Qwen2.5-VL-7B-Instruct-GGUF开源视觉语言模型 - 免费实现图像视频分析和结构化输出

首页

Qwen2.5 VL 7B Instruct GGUF

由 unsloth 开发

Qwen2.5-VL是Qwen家族最新推出的视觉语言模型，具备强大的视觉理解和多模态处理能力，支持图像、视频分析和结构化输出。

图像生成文本英语开源协议:Apache-2.0 #多模态代理 #长视频理解 #结构化数据提取

下载量 8,427

发布时间 : 5/11/2025

模型简介

Qwen2.5-VL是一款多模态视觉语言模型，专注于提升视觉理解、智能体功能和结构化输出能力，适用于金融、商业等多种场景。

模型特点

增强视觉理解

精准识别物体、文本、图表、图标和版式布局，支持复杂视觉内容分析

智能体功能

可直接作为视觉智能体运行，动态调用工具，支持计算机和手机操作场景

长视频理解

可解析超过1小时的视频内容，具备精准定位相关片段的事件捕捉能力

结构化输出

针对发票、表格等数据支持结构化输出，适用于金融、商业等专业场景

模型能力

图像分析

视频理解

文本识别

图表解析

视觉定位

结构化数据提取

多模态推理

使用案例

商业分析

发票处理

自动提取发票中的结构化数据

准确率高达95.7%（DocVQA测试集）

教育

图表理解

解析教学材料中的图表信息

ChartQA测试集准确率87.3%

智能助手

视觉智能体

作为智能体执行屏幕操作任务

ScreenSpot测试集得分84.7

🚀 Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct是Qwen系列的最新视觉语言模型，具备强大的视觉理解、分析和推理能力，能处理图像、视频等多模态数据，适用于金融、商业等多个领域。

🚀 快速开始

安装依赖

Qwen2.5-VL的代码已集成在最新的Hugging Face Transformers库中，建议使用以下命令从源代码进行安装：

pip install git+https://github.com/huggingface/transformers accelerate

否则可能会遇到以下错误：

KeyError: 'qwen2_5_vl'

同时，可安装一个工具包来更方便地处理各种类型的视觉输入，包括base64、URL以及交错的图像和视频：

# 强烈建议使用 `[decord]` 特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

若不使用Linux系统，可能无法从PyPI安装decord，此时可使用pip install qwen-vl-utils，它会回退到使用torchvision进行视频处理。不过，仍可从源代码安装decord，以便在加载视频时使用decord。

使用🤗 Transformers进行对话

以下是一个使用transformers和qwen_vl_utils调用聊天模型的代码示例：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用 flash_attention_2 以获得更好的加速和内存节省效果，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# 模型中每张图像的视觉令牌数量默认范围是 4 - 16384
# 可根据需要设置 min_pixels 和 max_pixels，例如令牌范围为 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频URL和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频URL兼容性在很大程度上取决于第三方库的版本。详情见下表。如果不想使用默认的后端，可以通过FORCE_QWENVL_VIDEO_READER=torchvision或FORCE_QWENVL_VIDEO_READER=decord来更改。

后端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

强烈建议用户（特别是中国大陆的用户）使用ModelScope。snapshot_download可以帮助解决下载检查点时遇到的问题。

✨ 主要特性

关键增强功能

视觉理解能力：Qwen2.5-VL不仅擅长识别常见物体（如花鸟鱼虫），还能高度有效地分析图像中的文本、图表、图标、图形和布局。
智能代理能力：Qwen2.5-VL可直接作为视觉代理，能够进行推理并动态指导工具，具备计算机和手机使用能力。
长视频理解与事件捕捉：Qwen2.5-VL可以理解超过1小时的视频，并且此次具备了通过精确确定相关视频片段来捕捉事件的新能力。
多格式视觉定位：Qwen2.5-VL可以通过生成边界框或点来准确地在图像中定位物体，并能为坐标和属性提供稳定的JSON输出。
结构化输出生成：对于发票、表单、表格等扫描数据，Qwen2.5-VL支持对其内容进行结构化输出，有利于金融、商业等领域的应用。

模型架构更新

用于视频理解的动态分辨率和帧率训练：通过采用动态FPS采样，将动态分辨率扩展到时间维度，使模型能够理解各种采样率的视频。相应地，在时间维度上使用ID和绝对时间对齐更新mRoPE，使模型能够学习时间序列和速度，最终获得精确确定特定时刻的能力。
精简高效的视觉编码器：通过策略性地将窗口注意力机制引入ViT，提高了训练和推理速度。同时，使用SwiGLU和RMSNorm进一步优化ViT架构，使其与Qwen2.5 LLM的结构保持一致。

目前有参数规模为30亿、70亿和720亿的三种模型。本仓库包含经过指令微调的70亿参数的Qwen2.5-VL模型。更多信息，请访问博客和GitHub。

📚 详细文档

评估

图像基准测试

基准测试	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B
MMMU_val	56	50.4	60	54.1	58.6
MMMU-Pro_val	34.3	-	37.6	30.5	41.0
DocVQA_test	93	93	-	94.5	95.7
InfoVQA_test	77.6	-	-	76.5	82.6
ChartQA_test	84.8	-	-	83.0	87.3
TextVQA_val	79.1	80.1	-	84.3	84.9
OCRBench	822	852	785	845	864
CC_OCR	57.7			61.6	77.8
MMStar	62.8			60.7	63.9
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6
MMT-Bench_test	-	-	-	63.7	63.6
MMStar	61.5	57.5	54.8	60.7	63.9
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1
HallBench_avg	45.2	48.1	46.1	50.6	52.9
MathVista_testmini	58.3	60.6	52.4	58.2	68.2
MathVision	-	-	-	16.3	25.07

视频基准测试

基准测试	Qwen2-VL-7B	Qwen2.5-VL-7B
MVBench	67.0	69.6
PerceptionTest_test	66.9	70.5
Video-MME_{wo/w subs}	63.3/69.0	65.1/71.6
LVBench		45.3
LongVideoBench		54.7
MMBench-Video	1.44	1.79
TempCompass		71.7
MLVU		70.2
CharadesSTA/mIoU	43.6

代理基准测试

基准测试	Qwen2.5-VL-7B
ScreenSpot	84.7
ScreenSpot Pro	29.0
AITZ_EM	81.9
Android Control High_EM	60.1
Android Control Low_EM	93.7
AndroidWorld_SR	25.5
MobileMiniWob++_SR	91.4

🔧 技术细节

模型架构

动态分辨率和帧率训练

通过采用动态FPS采样，将动态分辨率扩展到时间维度，使模型能够理解各种采样率的视频。相应地，在时间维度上使用ID和绝对时间对齐更新mRoPE，使模型能够学习时间序列和速度，最终获得精确确定特定时刻的能力。

精简高效的视觉编码器

通过策略性地将窗口注意力机制引入ViT，提高了训练和推理速度。同时，使用SwiGLU和RMSNorm进一步优化ViT架构，使其与Qwen2.5 LLM的结构保持一致。

长文本处理

当前的config.json设置的上下文长度最大为32,768个令牌。为了处理超过32,768个令牌的大量输入，使用了YaRN技术来增强模型的长度外推能力，确保在长文本上的最佳性能。但这种方法对时间和空间定位任务的性能有显著影响，因此不建议使用。对于长视频输入，由于MRoPE本身在ids方面更节省，因此可以直接将max_position_embeddings修改为更大的值，例如64k。

📄 许可证

本项目采用Apache 2.0许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用以下内容：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}