Qwen2.5-VL-7B-Instruct-GGUF开源多模态模型 - 支持图像理解与文本生成

首页

Qwen2.5 VL 7B Instruct GGUF

由 Mungert 开发

Qwen2.5-VL-7B-Instruct 是一个多模态视觉语言模型，支持图像理解和文本生成任务。

图像生成文本英语开源协议:Apache-2.0 #多模态视觉理解 #超低比特量化 #边缘设备部署

下载量 17.10k

发布时间 : 3/27/2025

模型简介

该模型是基于Qwen2.5架构的多模态模型，能够处理图像和文本输入，生成相应的文本输出。适用于图像描述、视觉问答等任务。

模型特点

多模态支持

能够同时处理图像和文本输入，生成相应的文本输出。

超低比特量化

采用IQ-DynamicGate技术，支持1-2比特量化，在保持高精度的同时显著减少模型大小。

动态精度分配

通过分层策略，对不同层采用不同的量化精度，优化模型性能。

模型能力

图像描述

视觉问答

多模态推理

使用案例

图像理解

图像描述生成

输入一张图片，模型生成对该图片的详细描述。

生成准确且详细的图像描述。

视觉问答

基于图像的问答

输入一张图片和相关问题，模型生成答案。

生成与图像内容相关的准确答案。

🚀 Qwen2.5-VL-7B-Instruct GGUF模型

Qwen2.5-VL-7B-Instruct GGUF模型是一系列专为图像文本到文本处理设计的多模态模型。这些模型基于transformers库构建，能够理解和处理图像与文本信息，在视觉语言任务中表现出色。

🚀 快速开始

使用llama.cpp运行Qwen 2.5 VL Instruct模型

下载Qwen 2.5 VL gguf文件：访问链接：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/tree/main，选择名称中不包含mmproj的gguf文件。示例gguf文件：https://huggingface.co/Mungert/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-q8_0.gguf 将该文件复制到你选择的文件夹。
下载Qwen 2.5 VL mmproj文件：同样访问上述链接，选择名称中包含mmproj的文件。示例mmproj文件：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf 将该文件复制到你选择的文件夹。
复制图像文件：将图像复制到与gguf文件相同的文件夹，或者适当修改路径。示例图像：https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/car-1.jpg 将该文件复制到你选择的文件夹。
运行CLI工具：在你选择的文件夹中运行以下命令：

llama-mtmd-cli -m Qwen2.5-VL-7B-Instruct-q8_0.gguf --mmproj Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf  -p "Describe this image." --image ./car-1.jpg

使用🤗 Transformers进行对话

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建议启用flash_attention_2以获得更好的加速和内存节省，特别是在多图像和视频场景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# 模型中每个图像的视觉令牌数量的默认范围是4 - 16384
# 你可以根据需要设置min_pixels和max_pixels，例如令牌范围为256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ 主要特性

视觉理解能力提升

Qwen2.5-VL不仅能够识别常见物体，如花卉、鸟类、鱼类和昆虫，还能对图像中的文本、图表、图标、图形和布局进行深入分析。

智能代理功能

Qwen2.5-VL可直接作为视觉代理，进行推理并动态调用工具，具备计算机和手机使用能力。

长视频理解与事件捕捉

Qwen2.5-VL能够理解长达1小时以上的视频，并具备捕捉事件的新能力，可精准定位相关视频片段。

多格式视觉定位

Qwen2.5-VL可以通过生成边界框或点来准确地定位图像中的物体，并为坐标和属性提供稳定的JSON输出。

结构化输出生成

对于发票、表单、表格等扫描数据，Qwen2.5-VL支持生成其内容的结构化输出，有助于金融、商业等领域的应用。

📦 安装指南

安装依赖库

pip install git+https://github.com/huggingface/transformers accelerate

安装工具包

# 强烈建议使用`[decord]`特性以加快视频加载速度
pip install qwen-vl-utils[decord]==0.0.8

💻 使用示例

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地视频路径和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含视频URL和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在Qwen 2.5 VL中，帧率信息也会输入到模型中以与绝对时间对齐
# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息以进行批量处理
messages = [messages1, messages2]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

📚 详细文档

选择合适的模型格式

选择正确的模型格式取决于你的硬件能力和内存限制。

BF16（Brain Float 16）

一种16位浮点格式，专为更快的计算而设计，同时保持良好的精度。
提供与FP32相似的动态范围，但内存使用更低。
如果你的硬件支持BF16加速（请检查设备规格），建议使用。
与FP32相比，适用于高性能推理，且内存占用减少。

F16（Float 16）

一种16位浮点格式，精度较高，但取值范围比BF16小。
适用于大多数支持FP16加速的设备（包括许多GPU和一些CPU）。
数值精度略低于BF16，但通常足以进行推理。

量化模型（Q4_K、Q6_K、Q8等）

量化可以在尽可能保持准确性的同时减小模型大小和内存使用。

低比特模型（Q4_K）：最适合最小化内存使用，但可能精度较低。
高比特模型（Q6_K、Q8_0）：精度更高，但需要更多内存。

极低比特量化（IQ3_XS、IQ3_S、IQ3_M、Q4_K、Q4_0）

这些模型针对极端内存效率进行了优化，非常适合低功耗设备或大规模部署，其中内存是关键限制因素。

模型文件详情

`Qwen2.5-VL-7B-Instruct-bf16.gguf`

模型权重以BF16格式保存。
如果你想将模型重新量化为不同格式，请使用此文件。
如果你的设备支持BF16加速，则最佳选择。

`Qwen2.5-VL-7B-Instruct-f16.gguf`

模型权重以F16格式保存。
如果你的设备支持FP16，特别是在BF16不可用时，请使用此文件。

`Qwen2.5-VL-7B-Instruct-bf16-q8_0.gguf`

输出和嵌入层保持为BF16。
所有其他层量化为Q8_0。
如果你的设备支持BF16，并且你想要一个量化版本，请使用此文件。

`Qwen2.5-VL-7B-Instruct-f16-q8_0.gguf`

输出和嵌入层保持为F16。
所有其他层量化为Q8_0。

`Qwen2.5-VL-7B-Instruct-q4_k.gguf`

输出和嵌入层量化为Q8_0。
所有其他层量化为Q4_K。
适用于内存有限的CPU推理。

`Qwen2.5-VL-7B-Instruct-q4_k_s.gguf`

最小的Q4_K变体，以牺牲精度为代价减少内存使用。
最适合极低内存设置。

`Qwen2.5-VL-7B-Instruct-q6_k.gguf`

输出和嵌入层量化为Q8_0。
所有其他层量化为Q6_K。

`Qwen2.5-VL-7B-Instruct-q8_0.gguf`

完全Q8量化的模型，以获得更高的精度。
需要更多内存，但提供更高的精度。

`Qwen2.5-VL-7B-Instruct-iq3_xs.gguf`

IQ3_XS量化，针对极端内存效率进行了优化。
最适合超低内存设备。

`Qwen2.5-VL-7B-Instruct-iq3_m.gguf`

IQ3_M量化，提供中等块大小以提高精度。
适用于低内存设备。

`Qwen2.5-VL-7B-Instruct-q4_0.gguf`

纯Q4_0量化，针对ARM设备进行了优化。
最适合基于ARM的设备或低内存环境。
为了获得更好的精度，建议使用IQ4_NL。

处理长文本

当前的config.json设置为上下文长度最大为32,768个令牌。为了处理超过32,768个令牌的大量输入，我们使用了YaRN技术，该技术用于增强模型的长度外推能力，确保在长文本上的最佳性能。

图像分辨率调整

模型支持广泛的分辨率输入。默认情况下，它使用原生分辨率进行输入，但更高的分辨率可以提高性能，但会增加计算量。用户可以设置最小和最大像素数，以实现适合自己需求的最佳配置，例如令牌计数范围为256 - 1280，以平衡速度和内存使用。

🔧 技术细节

超低比特量化与IQ-DynamicGate（1 - 2比特）

我们最新的量化方法为超低比特模型（1 - 2比特）引入了精度自适应量化，并在Llama-3-8B上通过基准测试证明了其改进效果。这种方法使用特定层的策略来保持准确性，同时保持极高的内存效率。

模型架构更新

动态分辨率和帧率训练用于视频理解

我们通过采用动态FPS采样将动态分辨率扩展到时间维度，使模型能够理解不同采样率的视频。相应地，我们在时间维度上使用ID和绝对时间对齐更新mRoPE，使模型能够学习时间序列和速度，最终获得定位特定时刻的能力。

精简高效的视觉编码器

我们通过在ViT中策略性地实现窗口注意力，提高了训练和推理速度。ViT架构进一步通过SwiGLU和RMSNorm进行了优化，使其与Qwen2.5 LLM的结构保持一致。

📄 许可证

本项目采用Apache-2.0许可证。

📈 评估

图像基准测试

基准测试	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B
MMMU_val	56	50.4	60	54.1	58.6
MMMU-Pro_val	34.3	-	37.6	30.5	41.0
DocVQA_test	93	93	-	94.5	95.7
InfoVQA_test	77.6	-	-	76.5	82.6
ChartQA_test	84.8	-	-	83.0	87.3
TextVQA_val	79.1	80.1	-	84.3	84.9
OCRBench	822	852	785	845	864
CC_OCR	57.7	-	-	61.6	77.8
MMStar	62.8	-	-	60.7	63.9
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6
MMT-Bench_test	-	-	-	63.7	63.6
MMStar	61.5	57.5	54.8	60.7	63.9
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1
HallBench_avg	45.2	48.1	46.1	50.6	52.9
MathVista_testmini	58.3	60.6	52.4	58.2	68.2
MathVision	-	-	-	16.3	25.07

视频基准测试

基准测试	Qwen2-VL-7B	Qwen2.5-VL-7B
MVBench	67.0	69.6
PerceptionTest_test	66.9	70.5
Video-MME_{wo/w subs}	63.3/69.0	65.1/71.6
LVBench	-	45.3
LongVideoBench	-	54.7
MMBench-Video	1.44	1.79
TempCompass	-	71.7
MLVU	-	70.2
CharadesSTA/mIoU	43.6	-

代理基准测试

基准测试	Qwen2.5-VL-7B
ScreenSpot	84.7
ScreenSpot Pro	29.0
AITZ_EM	81.9
Android Control High_EM	60.1
Android Control Low_EM	93.7
AndroidWorld_SR	25.5
MobileMiniWob++_SR	91.4

📖 引用

如果你觉得我们的工作有帮助，请引用以下内容：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}