license: apache-2.0
language:
- en
- zh
pipeline_tag: visual-question-answering
tags:
- multimodal
library_name: transformers
简介
Xinyuan-VL-2B 是赛灵格集团推出的面向终端侧的高性能多模态大模型,基于 Qwen/Qwen2-VL-2B-Instruct
微调而成,使用了超过500万条多模态数据及少量纯文本数据。该模型在多个权威基准测试中表现优异。
使用方法
为依托开源社区的蓬勃生态,我们选择对 Qwen/Qwen2-VL-2B-Instruct 进行微调,最终形成 Cylingo/Xinyuan-VL-2B
。因此,使用方式与 Qwen/Qwen2-VL-2B-Instruct
完全一致:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Cylingo/Xinyuan-VL-2B", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Cylingo/Xinyuan-VL-2B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "描述这张图片。"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
性能评估
我们使用 VLMEvalKit 工具包对 XinYuan-VL-2B 进行了多维度基准测试。结果显示,XinYuan-VL-2B 在多项指标上超越阿里云开源的 Qwen/Qwen2-VL-2B-Instruct 及其他具有影响力的同规模开源模型。
完整结果可查阅 opencompass/open_vlm_leaderboard:
测试集 |
MiniCPM-2B |
InternVL-2B |
Qwen2-VL-2B |
XinYuan-VL-2B |
MMB-CN-V11测试版 |
64.5 |
68.9 |
71.2 |
74.3 |
MMB-EN-V11测试版 |
65.8 |
70.2 |
73.2 |
76.5 |
MMB英文版 |
69.1 |
74.4 |
74.3 |
78.9 |
MMB中文版 |
66.5 |
71.2 |
73.8 |
76.12 |
CCBench |
45.3 |
74.7 |
53.7 |
55.5 |
MMT-Bench |
53.5 |
50.8 |
54.5 |
55.2 |
真实场景测试 |
55.8 |
57.3 |
62.9 |
63.9 |
SEEDBench_IMG |
67.1 |
70.9 |
72.86 |
73.4 |
AI2D图表理解 |
56.3 |
74.1 |
74.7 |
74.2 |
MMMU综合测评 |
38.2 |
36.3 |
41.1 |
40.9 |
幻觉测试 |
36.2 |
36.2 |
42.4 |
55.00 |
POPE评测 |
86.3 |
86.3 |
86.82 |
89.42 |
MME多模态评估 |
1808.6 |
1876.8 |
1872.0 |
1854.9 |
MMStar明星测试 |
39.1 |
49.8 |
47.5 |
51.87 |
SEEDBench2增强版 |
51.9 |
59.9 |
62.23 |
62.98 |
BLINK基准 |
41.2 |
42.8 |
43.92 |
42.98 |
OCRBench文字识别 |
605 |
781 |
794 |
782 |
TextVQA文本问答 |
74.1 |
73.4 |
79.7 |
77.6 |