模型简介
模型特点
模型能力
使用案例
🚀 Idefics2
Idefics2是一个开放的多模态模型,它可以接受任意的图像和文本输入序列,并生成文本输出。该模型能够回答关于图像的问题、描述视觉内容、基于多张图像创作故事,或者在没有视觉输入的情况下单纯作为语言模型使用。相较于Idefics1,它在光学字符识别(OCR)、文档理解和视觉推理等方面的能力有了显著提升。
🚀 快速开始
环境准备
在开始使用Idefics2之前,你需要安装必要的库。可以使用以下命令进行安装:
pip install transformers requests torch pillow
代码示例
以下是使用idefics2-8b-base
和idefics2-8b
进行文本生成的代码示例:
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
idefics2-8b-base
示例
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)
# Create inputs
prompts = [
"<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
"In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']
idefics2-8b
和idefics2-8b-chatty
示例
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
).to(DEVICE)
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do we see in this image?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "And how about this image?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']
文本生成推理
Idefics2已集成到TGI中,我们为idefics2-8b
和idefics2-8b-chatty
提供了API端点。
from text_generation import Client
API_TOKEN="<YOUR_API_TOKEN>"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics2-8b-chatty"
# System prompt used in the playground for `idefics2-8b-chatty`
SYSTEM_PROMPT = "System: The following is a conversation between Idefics2, a highly knowledgeable and intelligent visual AI assistant created by Hugging Face, referred to as Assistant, and a human user called User. In the following interactions, User and Assistant will converse in natural language, and Assistant will do its best to answer User’s questions. Assistant has the ability to perceive images and reason about them, but it cannot generate images. Assistant was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. When prompted with an image, it does not make up facts.<end_of_utterance>\nAssistant: Hello, I'm Idefics2, Huggingface's latest multimodal assistant. How can I help you?<end_of_utterance>\n"
QUERY = "User:Describe this image.<end_of_utterance>\nAssistant:"
client = Client(
base_url=API_URL,
headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
"max_new_tokens": 512,
"repetition_penalty": 1.1,
"do_sample": False,
}
generated_text = client.generate(prompt=SYSTEM_PROMPT + QUERY, **generation_args)
generated_text
✨ 主要特性
- 多模态处理能力:能够处理图像和文本的任意组合输入,支持图像描述、视觉问答等多种任务。
- 高分辨率图像支持:可以处理高达980x980分辨率的图像,无需将图像调整为固定大小的正方形。
- 增强的OCR能力:通过集成相关数据,显著提升了在图像和文档中识别和转录文本的能力。
- 简化的视觉特征集成:采用新的架构,简化了视觉特征与语言模型的集成过程。
- 多阶段训练:通过两阶段训练,提高了模型的效率和性能。
📦 安装指南
目前文档未提及具体的安装步骤,你可以参考上述快速开始部分的环境准备步骤进行安装。
💻 使用示例
基础用法
上述快速开始部分的代码示例展示了如何使用Idefics2进行文本生成,包括idefics2-8b-base
和idefics2-8b
的使用方法。
高级用法
如果你需要进行微调,可以参考以下资源:
- 使用TRL库进行微调的脚本:Script
- 使用Hugging Face Trainer进行微调的教程笔记本:Tutorial notebook
📚 详细文档
模型概述
属性 | 详情 |
---|---|
开发团队 | Hugging Face |
模型类型 | 多模态模型(图像+文本) |
语言 | 英文 |
许可证 | Apache 2.0 |
父模型 | google/siglip-so400m-patch14-384 和 mistralai/Mistral-7B-v0.1 |
更多信息资源 | OBELICS 的描述:OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents;论文:What matters when building vision-language models? |
模型用途
idefics2-8b-base
和idefics2-8b
可用于多模态(图像+文本)任务的推理,如图像描述、视觉问答等。对于特定用例和数据,建议对idefics2-8b
进行微调以获得最佳效果。idefics2-8b-chatty
经过进一步微调,适用于长对话场景。
技术细节
Idefics2在与其他开放多模态模型相比时,在其规模(80亿参数)下表现出了强大的性能,并且在很多情况下能与闭源系统相竞争。它为各种特定用例的微调提供了坚实的基础。
点击展开结果表以获取更多详细信息。
(验证集/测试集) |
(测试迷你集) |
(验证集) |
(测试集) |
(测试开发集) |
(测试集) |
||||
---|---|---|---|---|---|---|---|---|---|
DeepSeek-VL | ✅ | 7B | 576 | 36.6/- | 36.1 | 64.4 | 73.2 | - | 49.6 |
LLaVa-NeXT-Mistral-7B | ✅ | 7B | 2880 | 35.3/- | 37.7 | 65.7 | 68.7 | 82.2 | - |
LLaVa-NeXT-13B | ✅ | 13B | 2880 | 36.2/- | 35.3 | 67.1 | 70.0 | 82.8 | - |
LLaVa-NeXT-34B | ✅ | 34B | 2880 | 51.1/44.7 | 46.5 | 69.5 | 79.3 | 83.7 | - |
MM1-Chat-7B | ❌ | 7B | 720 | 37.0/35.6 | 35.9 | 72.8 | 72.3 | - | - |
MM1-Chat-30B | ❌ | 30B | 720 | 44.7/40.3 | 39.4 | 73.5 | 75.1 | 83.7 | |
Gemini 1.0 Pro | ❌ | 🤷♂️ | 🤷♂️ | 47.9/- | 45.2 | 74.6 | - | 71.2 | 88.1 |
Gemini 1.5 Pro | ❌ | 🤷♂️ | 🤷♂️ | 58.5/- | 52.1 | 73.5 | - | 73.2 | 86.5 |
Claude 3 Haiku | ❌ | 🤷♂️ | 🤷♂️ | 50.2/- | 46.4 | - | - | - | 88.8 |
Idefics1 instruct (32-shots) | ✅ | 80B | - | - | - | 39.3 | - | 68.8 | - |
Idefics2 (无图像分割) | ✅ | 8B | 64 | 43.5/37.9 | 51.6 | 70.4 | 76.8 | 80.8 | 67.3 |
Idefics2 (有图像分割) | ✅ | 8B | 320 | 43.0/37.7 | 51.4 | 73.0 | 76.7 | 81.2 | 74.0 |
Idefics2在Idefics1的基础上进行了多项改进:
- 高分辨率图像处理:采用NaViT策略,能够处理原始分辨率和纵横比的图像,避免了传统的图像调整大小操作。同时,借鉴SPHINX的策略,支持子图像分割和处理高分辨率图像。
- 增强的OCR能力:通过集成相关数据,显著提升了在图像和文档中识别和转录文本的能力,同时在处理图表、图形和文档相关问题时表现更好。
- 简化的视觉特征集成:采用新的架构,简化了视觉特征与语言模型的集成过程,提高了模型的效率。
- 性能提升:在模型大小缩小10倍的情况下,性能相比Idefics1有了显著提升。
训练过程
Idefics2采用两阶段训练:
- 第一阶段:将图像以SigLIP的原始分辨率(384x384)输入模型。
- 第二阶段:将图像以其原始分辨率(最大980,最小378)和纵横比输入模型,并添加PDFA、Rendered-Text和IDL等数据。
之后,在The Cauldron以及9个仅文本的指令微调数据集上进行指令微调。
模型优化
半精度加载
如果你的GPU支持,建议以半精度(torch.float16
或torch.bfloat16
)加载和运行模型:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
).to(DEVICE)
视觉编码器效率优化
如果你的GPU内存有限,可以采取以下措施:
- 禁用图像分割:在初始化处理器时添加
do_image_splitting=False
:
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
- 降低最大图像分辨率:在初始化处理器时添加
size= {"longest_edge": 448, "shortest_edge": 378}
:
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", size= {"longest_edge": 448, "shortest_edge": 378})
使用Flash-attention 2加速生成
首先,确保安装了flash-attn
库。然后,在加载模型时添加_attn_implementation="flash_attention_2"
:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
+ _attn_implementation="flash_attention_2",
).to(DEVICE)
4位量化
可以使用AWQ或bitsandbytes
进行4位量化,具体方法请参考文档中的代码示例。
模型优化对比
Flash attention 2 | 图像分割 | 浮点类型 | 4位量化 | 峰值GPU内存 (GB) | 20次生成时间 (秒) |
---|---|---|---|---|---|
否 | 是 | fp32 | 否 | 54.9 | 55.6 |
否 | 是 | bf16 | 否 | 41.3 | 34.3 |
否 | 是 | fp16 | 否 | 36.7 | 33.3 |
是 | 是 | fp16 | 否 | 21.0 | 13.3 |
是 | 是 | fp16 | bitsandbytes (整个模型) | 8.9 | 19.9 |
否 | 是 | fp16 | bitsandbytes (整个模型) | 24.7 | 40.4 |
否 | 是 | fp16 | AWQ (仅LLM) | 26.4 | 37.1 |
是 | 是 | fp16 | AWQ (仅LLM) | 10.7 | 16.3 |
否 | 是 | fp16 | AWQ + 融合 (仅LLM) | 26.0 | 38.4 |
否 | 否 | fp32 | 否 | 38.8 | 17.5 |
否 | 否 | bf16 | 否 | 22.2 | 14.4 |
否 | 否 | fp16 | 否 | 21.3 | 13.9 |
是 | 否 | fp16 | 否 | 18.1 | 10.4 |
是 | 否 | fp16 | bitsandbytes (整个模型) | 6.0 | 17.3 |
否 | 否 | fp16 | bitsandbytes (整个模型) | 9.2 | 20.9 |
否 | 否 | fp16 | AWQ (仅LLM) | 10.9 | 15.9 |
是 | 否 | fp16 | AWQ (仅LLM) | 7.8 | 12.3 |
否 | 否 | fp16 | AWQ + 融合 (仅LLM) | 10.5 | 19.5 |
🔧 技术细节
模型架构
Idefics2基于google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1两个预训练模型构建,采用了新的架构来简化视觉特征与语言模型的集成。
训练数据
Idefics2的训练数据包括:
- HuggingFaceM4/OBELICS
- laion/laion-coco
- wikipedia
- facebook/pmd
- pixparse/idl-wds
- pixparse/pdfa-eng-wds
- wendlerc/RenderedText
- HuggingFaceM4/the_cauldron
- teknium/OpenHermes-2.5
- GAIR/lima
- databricks/databricks-dolly-15k
- meta-math/MetaMathQA
- TIGER-Lab/MathInstruct
- microsoft/orca-math-word-problems-200k
- camel-ai/math
- AtlasUnified/atlas-math-sets
- tiedong/goat
- Lin-Chen/ShareGPT4V
- jxu124/llava_conversation_58k
训练过程
Idefics2采用两阶段训练,具体过程如上述详细文档部分所述。
📄 许可证
Idefics2基于Apache 2.0许可证发布,其依赖的两个预训练模型google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1也采用了相同的许可证。
⚠️ 重要提示
⚠️ 重要提示
💡 使用建议
- 为了获得最佳结果,建议在特定用例和数据上对
idefics2-8b
进行微调。- 如果你的GPU支持,建议以半精度加载和运行模型,以提高效率。
- 在使用模型时,要注意其可能存在的偏差和局限性,避免在高风险场景中使用。
📖 引用
如果你使用了Idefics2,请引用以下文献:
@misc{laurencon2023obelics,
title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
year={2023},
eprint={2306.16527},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
@misc{laurençon2024matters,
title={What matters when building vision-language models?},
author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
year={2024},
eprint={2405.02246},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
🙏 致谢
感谢@yjernite、@sasha、@meg、@giadap、@jack-kumar和@frimelle在模型红队测试方面提供的帮助。








