granite-vision-3.2-2b开源视觉语言模型 - 高效提取表格图表等文档内容

首页

Granite Vision 3.2 2b

由 unsloth 开发

granite-vision-3.2-2b是一款紧凑高效的视觉语言模型，专为视觉文档理解设计，能够从表格、图表、信息图等中自动提取内容。

图像生成文本

Transformers

英语开源协议:Apache-2.0 #文档视觉理解 #图表数据提取 #高效OCR

下载量 43

发布时间 : 3/14/2025

模型简介

该模型基于精心策划的指令跟随数据集训练而成，包含多样化的公共数据集和针对广泛文档理解及通用图像任务定制的合成数据集。它通过对Granite大语言模型进行图像和文本模态的微调而训练完成。

模型特点

高效视觉文档理解

能够从表格、图表、信息图、绘图、示意图等中自动提取内容

多模态能力

同时处理视觉和文本数据，适用于广泛的业务场景

高性能

在多个文档理解基准测试中表现优于同类模型

轻量级设计

仅2B参数，保持高效的同时提供强大性能

模型能力

表格分析

图表理解

信息图解析

光学字符识别(OCR)

文档内容问答

通用图像理解

视觉问答

使用案例

文档处理

文档问答

基于文档内容回答问题

在DocVQA基准测试中达到0.89准确率

图表分析

从图表中提取和分析数据

在ChartQA基准测试中达到0.87准确率

通用视觉理解

视觉问答

回答关于图像内容的问题

在VQAv2基准测试中达到0.78准确率

真实世界场景理解

理解现实世界图像中的内容

在RealWorldQA基准测试中达到0.63准确率

🚀 granite-vision-3.2-2b

granite-vision-3.2-2b是一个紧凑高效的视觉语言模型，专为视觉文档理解而设计，可自动从表格、图表、信息图、绘图、示意图等中提取内容。该模型在精心策划的指令跟随数据集上进行训练，涵盖多种公共数据集和合成数据集，支持广泛的文档理解和通用图像任务。

📦 安装指南

使用`transformers`库

首先，确保安装最新版本的transformers库：

pip install transformers>=4.49

使用`vLLM`库

若要使用vLLM加载模型，需先安装以下库：

pip install torch torchvision torchaudio
pip install vllm==0.6.6

💻 使用示例

基础用法（使用`transformers`库）

from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "ibm-granite/granite-vision-3.2-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

# prepare image and text prompt, using the appropriate prompt template

img_path = hf_hub_download(repo_id=model_path, filename='example.png')

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_path},
            {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

高级用法（使用`vLLM`库）

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image

model_path = "ibm-granite/granite-vision-3.2-2b"

model = LLM(
    model=model_path,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=64,
)

# Define the question we want to answer and format the prompt
image_token = "<image>"
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"

question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)

# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}

outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")

📚 详细文档

模型微调

若要了解针对新任务微调Granite Vision的示例，请参考此笔记本。

使用Granite Vision进行MM RAG

若要了解使用Granite Vision进行多模态检索增强生成（MM RAG）的示例，请参考此笔记本。

🔧 技术细节

模型架构

granite-vision-3.2-2b的架构包含以下组件： (1) 视觉编码器：SigLIP（https://huggingface.co/docs/transformers/en/model_doc/siglip）。 (2) 视觉语言连接器：具有gelu激活函数的两层多层感知机（MLP）。 (3) 大语言模型：上下文长度为128k的granite-3.1-2b-instruct（https://huggingface.co/ibm-granite/granite-3.1-2b-instruct）。

我们基于LLaVA（https://llava-vl.github.io）来训练模型。在AnyRes中使用多层编码器特征和更密集的网格分辨率，以增强模型理解细微视觉内容的能力，这对准确解释文档图像至关重要。

训练数据

总体而言，我们的训练数据主要来自两个关键来源：(1) 公开可用的数据集；(2) 针对特定能力（包括文档理解任务）内部创建的合成数据。数据集的详细归属信息可在技术报告中找到。

基础设施

我们使用IBM的超级计算集群Blue Vela来训练Granite Vision，该集群配备了NVIDIA H100 GPU。这个集群为在数千个GPU上训练我们的模型提供了可扩展且高效的基础设施。

📄 许可证

本模型采用 Apache 2.0 许可证。

⚠️ 重要提示

使用大型视觉和语言模型存在风险和伦理考量，包括但不限于：偏差与公平性、错误信息和自主决策。granite-vision-3.2-2b也不例外。尽管我们的对齐过程包含安全考量，但模型在某些情况下可能会对用户提示产生不准确、有偏差或不安全的响应。

💡 使用建议

为增强安全性，建议将granite-vision-3.2-2b与Granite Guardian一起使用。Granite Guardian是一个经过微调的指令模型，旨在检测和标记提示和响应中符合IBM AI风险图谱中关键维度的风险。它的训练包括人工标注数据和受内部红队测试启发的合成数据，使其在标准基准测试中优于类似的开源模型，提供了额外的安全保障。我们建议将此模型用于文档理解任务，并注意更通用的视觉任务可能会带来更高的触发有偏差或有害输出的内在风险。

📋 模型信息

属性	详情
模型类型	granite-vision-3.2-2b
基础模型	ibm-granite/granite-vision-3.2-2b
训练数据	公开数据集和内部合成数据
库名称	transformers
许可证	Apache 2.0

📈 评估结果

我们使用标准的llms-eval基准对Granite Vision 3.2与其他参数范围在1B - 4B的视觉语言模型（VLMs）进行了评估。评估涵盖了多个公共基准，特别侧重于文档理解任务，同时也包括通用视觉问答基准。

	Molmo-E	InternVL2	Phi3v	Phi3.5v	Granite Vision
文档基准测试
DocVQA	0.66	0.87	0.87	0.88	0.89
ChartQA	0.60	0.75	0.81	0.82	0.87
TextVQA	0.62	0.72	0.69	0.7	0.78
AI2D	0.63	0.74	0.79	0.79	0.76
InfoVQA	0.44	0.58	0.55	0.61	0.64
OCRBench	0.65	0.75	0.64	0.64	0.77
LiveXiv VQA	0.47	0.51	0.61	-	0.61
LiveXiv TQA	0.36	0.38	0.48	-	0.57
其他基准测试
MMMU	0.32	0.35	0.42	0.44	0.37
VQAv2	0.57	0.75	0.76	0.77	0.78
RealWorldQA	0.55	0.34	0.60	0.58	0.63
VizWiz VQA	0.49	0.46	0.57	0.57	0.63
OK VQA	0.40	0.44	0.51	0.53	0.56