Idefics2-8b-base开源多模态模型 - 免费处理图文，OCR、文档理解超好用

首页

Idefics2 8b Base

由 HuggingFaceM4 开发

Idefics2 是 Hugging Face 开发的开源多模态模型，能够处理图像和文本输入并生成文本输出，在 OCR、文档理解和视觉推理方面表现优异。

图像生成文本

Transformers

英语开源协议:Apache-2.0 #高分辨率图像处理 #多模态问答 #文档OCR增强

下载量 1,409

发布时间 : 4/9/2024

模型简介

Idefics2 是一个多模态模型，可以接受任意序列的图像和文本输入，并生成文本输出。它能够回答关于图像的问题、描述视觉内容、基于多张图像创作故事，也可以作为纯语言模型使用。

模型特点

多模态处理能力

能够同时处理图像和文本输入，并生成连贯的文本输出

原生分辨率支持

遵循 NaViT 策略，以原生分辨率和宽高比处理图像（最高 980 x 980）

高分辨率图像分割

可选地支持子图像分割，可处理非常高分辨率的图像

增强的OCR能力

通过专门训练显著提升了文本识别和文档理解能力

模型能力

图像描述

视觉问答

多图像故事创作

文档理解

图表分析

纯文本语言模型

使用案例

教育

数学问题解答

基于图像中的数学问题提供解答

在数学相关测试集上表现优异

内容创作

多图像故事创作

基于多张相关图像生成连贯的故事

文档处理

文档内容理解

识别和理解扫描文档中的内容和结构

在DocVQA测试集上达到74.0分

🚀 Idefics2

Idefics2是一个开放的多模态模型，它可以接受任意的图像和文本输入序列，并生成文本输出。该模型能够回答关于图像的问题、描述视觉内容、基于多张图像创作故事，或者在没有视觉输入的情况下单纯作为语言模型使用。相较于Idefics1，它在光学字符识别（OCR）、文档理解和视觉推理等方面的能力有了显著提升。

🚀 快速开始

环境准备

在开始使用Idefics2之前，你需要安装必要的库。可以使用以下命令进行安装：

pip install transformers requests torch pillow

代码示例

以下是使用idefics2-8b-base和idefics2-8b进行文本生成的代码示例：

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

`idefics2-8b-base`示例

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)

# Create inputs
prompts = [
  "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
  "In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']

`idefics2-8b`和`idefics2-8b-chatty`示例

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

文本生成推理

Idefics2已集成到TGI中，我们为idefics2-8b和idefics2-8b-chatty提供了API端点。

from text_generation import Client

API_TOKEN="<YOUR_API_TOKEN>"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics2-8b-chatty"

# System prompt used in the playground for `idefics2-8b-chatty`
SYSTEM_PROMPT = "System: The following is a conversation between Idefics2, a highly knowledgeable and intelligent visual AI assistant created by Hugging Face, referred to as Assistant, and a human user called User. In the following interactions, User and Assistant will converse in natural language, and Assistant will do its best to answer User’s questions. Assistant has the ability to perceive images and reason about them, but it cannot generate images. Assistant was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. When prompted with an image, it does not make up facts.<end_of_utterance>\nAssistant: Hello, I'm Idefics2, Huggingface's latest multimodal assistant. How can I help you?<end_of_utterance>\n"
QUERY = "User:![](https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg)Describe this image.<end_of_utterance>\nAssistant:"

client = Client(
    base_url=API_URL,
    headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
    "max_new_tokens": 512,
    "repetition_penalty": 1.1,
    "do_sample": False,
}
generated_text = client.generate(prompt=SYSTEM_PROMPT + QUERY, **generation_args)
generated_text

✨ 主要特性

多模态处理能力：能够处理图像和文本的任意组合输入，支持图像描述、视觉问答等多种任务。
高分辨率图像支持：可以处理高达980x980分辨率的图像，无需将图像调整为固定大小的正方形。
增强的OCR能力：通过集成相关数据，显著提升了在图像和文档中识别和转录文本的能力。
简化的视觉特征集成：采用新的架构，简化了视觉特征与语言模型的集成过程。
多阶段训练：通过两阶段训练，提高了模型的效率和性能。

📦 安装指南

目前文档未提及具体的安装步骤，你可以参考上述快速开始部分的环境准备步骤进行安装。

💻 使用示例

基础用法

上述快速开始部分的代码示例展示了如何使用Idefics2进行文本生成，包括idefics2-8b-base和idefics2-8b的使用方法。

高级用法

如果你需要进行微调，可以参考以下资源：

使用TRL库进行微调的脚本：Script
使用Hugging Face Trainer进行微调的教程笔记本：Tutorial notebook

📚 详细文档

模型概述

属性	详情
开发团队	Hugging Face
模型类型	多模态模型（图像+文本）
语言	英文
许可证	Apache 2.0
父模型	google/siglip-so400m-patch14-384 和 mistralai/Mistral-7B-v0.1
更多信息资源	OBELICS 的描述：OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents；论文：What matters when building vision-language models?

模型用途

idefics2-8b-base和idefics2-8b可用于多模态（图像+文本）任务的推理，如图像描述、视觉问答等。对于特定用例和数据，建议对idefics2-8b进行微调以获得最佳效果。idefics2-8b-chatty经过进一步微调，适用于长对话场景。

技术细节

Idefics2在与其他开放多模态模型相比时，在其规模（80亿参数）下表现出了强大的性能，并且在很多情况下能与闭源系统相竞争。它为各种特定用例的微调提供了坚实的基础。

点击展开结果表以获取更多详细信息。

模型	开放权重	大小	每张图像的标记数	MMMU (验证集/测试集)	MathVista (测试迷你集)	TextVQA (验证集)	MMBench (测试集)	VQAv2 (测试开发集)	DocVQA (测试集)
DeepSeek-VL	✅	7B	576	36.6/-	36.1	64.4	73.2	-	49.6
LLaVa-NeXT-Mistral-7B	✅	7B	2880	35.3/-	37.7	65.7	68.7	82.2	-
LLaVa-NeXT-13B	✅	13B	2880	36.2/-	35.3	67.1	70.0	82.8	-
LLaVa-NeXT-34B	✅	34B	2880	51.1/44.7	46.5	69.5	79.3	83.7	-
MM1-Chat-7B	❌	7B	720	37.0/35.6	35.9	72.8	72.3	-	-
MM1-Chat-30B	❌	30B	720	44.7/40.3	39.4	73.5	75.1	83.7
Gemini 1.0 Pro	❌	🤷‍♂️	🤷‍♂️	47.9/-	45.2	74.6	-	71.2	88.1
Gemini 1.5 Pro	❌	🤷‍♂️	🤷‍♂️	58.5/-	52.1	73.5	-	73.2	86.5
Claude 3 Haiku	❌	🤷‍♂️	🤷‍♂️	50.2/-	46.4	-	-	-	88.8

Idefics1 instruct (32-shots)	✅	80B	-	-	-	39.3	-	68.8	-

Idefics2 (无图像分割)	✅	8B	64	43.5/37.9	51.6	70.4	76.8	80.8	67.3
Idefics2 (有图像分割)	✅	8B	320	43.0/37.7	51.4	73.0	76.7	81.2	74.0

Idefics2在Idefics1的基础上进行了多项改进：

高分辨率图像处理：采用NaViT策略，能够处理原始分辨率和纵横比的图像，避免了传统的图像调整大小操作。同时，借鉴SPHINX的策略，支持子图像分割和处理高分辨率图像。
增强的OCR能力：通过集成相关数据，显著提升了在图像和文档中识别和转录文本的能力，同时在处理图表、图形和文档相关问题时表现更好。
简化的视觉特征集成：采用新的架构，简化了视觉特征与语言模型的集成过程，提高了模型的效率。
性能提升：在模型大小缩小10倍的情况下，性能相比Idefics1有了显著提升。

训练过程

Idefics2采用两阶段训练：

第一阶段：将图像以SigLIP的原始分辨率（384x384）输入模型。
第二阶段：将图像以其原始分辨率（最大980，最小378）和纵横比输入模型，并添加PDFA、Rendered-Text和IDL等数据。

之后，在The Cauldron以及9个仅文本的指令微调数据集上进行指令微调。

模型优化

半精度加载

如果你的GPU支持，建议以半精度（torch.float16或torch.bfloat16）加载和运行模型：

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
).to(DEVICE)

视觉编码器效率优化

如果你的GPU内存有限，可以采取以下措施：

禁用图像分割：在初始化处理器时添加do_image_splitting=False：

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

降低最大图像分辨率：在初始化处理器时添加size= {"longest_edge": 448, "shortest_edge": 378}：

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", size= {"longest_edge": 448, "shortest_edge": 378})

使用Flash-attention 2加速生成

首先，确保安装了flash-attn库。然后，在加载模型时添加_attn_implementation="flash_attention_2"：

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)

4位量化

可以使用AWQ或bitsandbytes进行4位量化，具体方法请参考文档中的代码示例。

模型优化对比

Flash attention 2	图像分割	浮点类型	4位量化	峰值GPU内存 (GB)	20次生成时间 (秒)
否	是	fp32	否	54.9	55.6
否	是	bf16	否	41.3	34.3
否	是	fp16	否	36.7	33.3
是	是	fp16	否	21.0	13.3
是	是	fp16	bitsandbytes (整个模型)	8.9	19.9
否	是	fp16	bitsandbytes (整个模型)	24.7	40.4
否	是	fp16	AWQ (仅LLM)	26.4	37.1
是	是	fp16	AWQ (仅LLM)	10.7	16.3
否	是	fp16	AWQ + 融合 (仅LLM)	26.0	38.4

否	否	fp32	否	38.8	17.5
否	否	bf16	否	22.2	14.4
否	否	fp16	否	21.3	13.9
是	否	fp16	否	18.1	10.4
是	否	fp16	bitsandbytes (整个模型)	6.0	17.3
否	否	fp16	bitsandbytes (整个模型)	9.2	20.9
否	否	fp16	AWQ (仅LLM)	10.9	15.9
是	否	fp16	AWQ (仅LLM)	7.8	12.3
否	否	fp16	AWQ + 融合 (仅LLM)	10.5	19.5

🔧 技术细节

模型架构

Idefics2基于google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1两个预训练模型构建，采用了新的架构来简化视觉特征与语言模型的集成。

训练数据

Idefics2的训练数据包括：

训练过程

Idefics2采用两阶段训练，具体过程如上述详细文档部分所述。

📄 许可证

Idefics2基于Apache 2.0许可证发布，其依赖的两个预训练模型google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1也采用了相同的许可证。

⚠️ 重要提示

⚠️ 重要提示

Idefics2不能与Transformers版本在4.41.0到4.43.3（包括）之间的版本兼容。请参考issue和修复方案。

该模型目前可能会在被要求时提供医疗诊断，但我们不建议用户在未经适当调整和评估的情况下将其用于医疗应用。

尽管我们对训练数据进行了过滤，但仍发现有少量不适合所有受众的内容，模型可能会生成类似的文本。

我们对预训练语言模型骨干的组成了解相对较少，这使得难以将继承的限制或问题行为与其数据联系起来。

💡 使用建议

为了获得最佳结果，建议在特定用例和数据上对idefics2-8b进行微调。

如果你的GPU支持，建议以半精度加载和运行模型，以提高效率。

在使用模型时，要注意其可能存在的偏差和局限性，避免在高风险场景中使用。

📖 引用

如果你使用了Idefics2，请引用以下文献：

@misc{laurencon2023obelics,
      title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
      author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
      year={2023},
      eprint={2306.16527},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{laurençon2024matters,
      title={What matters when building vision-language models?}, 
      author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
      year={2024},
      eprint={2405.02246},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}