MiniCPM-V-2_6开源多模态大模型 - 免费部署，支持单多图及视频理解

首页

Minicpm V 2 6

由 jchevallard 开发

MiniCPM-V 2.6是MiniCPM-V系列最新最强的多模态大模型，支持单图、多图和视频理解，具有领先的性能和极致效率。

图像生成文本

Transformers

其他#多模态理解 #移动端部署 #多图推理

下载量 118

发布时间 : 8/30/2024

模型简介

MiniCPM-V 2.6是一个基于SigLip-400M和Qwen2-7B构建的多模态大模型，总参数量达80亿。该模型支持单图、多图和视频理解，具有强大的OCR和多语言能力，适用于多种视觉和语言任务。

模型特点

领先性能

在OpenCompass综合评估中，MiniCPM-V 2.6平均得分达65.2分，超越GPT-4o mini、GPT-4V、Gemini 1.5 Pro和Claude 3.5 Sonnet等商用模型的单图理解能力。

多图理解与上下文学习

支持跨多图对话推理，在Mantis-Eval、BLINK、Mathverse mv和Sciverse mv等多图基准上达到SOTA水平，并展现出优秀的上下文学习能力。

视频理解

支持视频输入，可进行时空信息对话与密集描述。在Video-MME基准上超越GPT-4V、Claude 3.5 Sonnet和LLaVA-NeXT-Video-34B。

强大OCR与其他能力

支持任意长宽比图像（最高1344x1344/180万像素）处理，在OCRBench上达到SOTA水平，超越GPT-4o、GPT-4V和Gemini 1.5 Pro等商用模型。

极致效率

具备SOTA级token密度，处理180万像素图像仅生成640个token，比主流模型少75%，直接提升推理速度、首token延迟、内存占用和功耗表现。

开箱即用

提供多种使用方式，包括本地CPU推理、量化模型、vLLM推理、新领域/任务微调、快速本地WebUI部署和在线演示。

模型能力

单图理解

多图理解

视频理解

OCR

多语言支持

上下文学习

跨图对话推理

时空信息对话

密集描述

使用案例

图像理解

OCR识别

识别图像中的文字信息

在OCRBench上达到SOTA水平

多图对比

比较多张图像的异同

在Mantis-Eval、BLINK等多图基准上达到SOTA水平

视频理解

视频内容分析

分析视频中的时空信息

在Video-MME基准上超越GPT-4V、Claude 3.5 Sonnet和LLaVA-NeXT-Video-34B

多语言应用

多语言菜单翻译

翻译图像中的多语言菜单

支持中英德法意韩等多语言

🚀 MiniCPM-V 2.6

MiniCPM-V 2.6 是 MiniCPM-V 系列中最新且功能最强大的模型。该模型基于 SigLip - 400M 和 Qwen2 - 7B 构建，总参数达 80 亿。相较于 MiniCPM - Llama3 - V 2.5，它在性能上有显著提升，并引入了多图像和视频理解的新特性。

🚀 快速开始

你可以点击这里尝试 MiniCPM-V 2.6 的演示。

环境要求

在 NVIDIA GPU 上使用 Huggingface transformers 进行推理。以下是在 Python 3.10 上测试通过的依赖：

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord

代码示例

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

更多使用细节请查看 GitHub。

✨ 主要特性

🔥 领先的性能

MiniCPM-V 2.6 在最新版本的 OpenCompass 上，通过对 8 个流行基准的综合评估，平均得分达到 65.2。仅 80 亿参数的它，在单图像理解方面超越了广泛使用的专有模型，如 GPT - 4o mini、GPT - 4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet。

🖼️ 多图像理解和上下文学习

MiniCPM-V 2.6 还能进行多图像对话和推理。它在 Mantis - Eval、BLINK、Mathverse mv 和 Sciverse mv 等流行的多图像基准测试中取得了最先进的性能，并展现出了良好的上下文学习能力。

🎬 视频理解

MiniCPM-V 2.6 可以接受视频输入，进行对话并为时空信息提供密集字幕。在有/无字幕的 Video - MME 测试中，它的表现优于 GPT - 4V、Claude 3.5 Sonnet 和 LLaVA - NeXT - Video - 34B。

💪 强大的 OCR 能力及其他

MiniCPM-V 2.6 可以处理任意宽高比、像素高达 180 万（如 1344x1344）的图像。它在 OCRBench 上取得了最先进的性能，超越了 GPT - 4o、GPT - 4V 和 Gemini 1.5 Pro 等专有模型。基于最新的 RLAIF - V 和 VisCPM 技术，它具有可靠的行为，在 Object HalBench 上的幻觉率显著低于 GPT - 4o 和 GPT - 4V，并支持英语、中文、德语、法语、意大利语、韩语等多语言能力。

🚀 卓越的效率

除了模型规模友好外，MiniCPM-V 2.6 还展现出了最先进的令牌密度（即每个视觉令牌编码的像素数）。处理 180 万像素的图像时，它仅生成 640 个令牌，比大多数模型少 75%。这直接提高了推理速度、首令牌延迟、内存使用和功耗。因此，MiniCPM-V 2.6 可以在 iPad 等终端设备上高效支持实时视频理解。

💫 易于使用

MiniCPM-V 2.6 可以通过多种方式轻松使用：

llama.cpp 和 ollama 支持在本地设备上进行高效的 CPU 推理。
int4 和 GGUF 格式的 16 种量化模型。
vLLM 支持高吞吐量和内存高效的推理。
在新领域和任务上进行微调。
使用 Gradio 快速设置本地 WebUI 演示。
在线 Web 演示。

📚 详细文档

评估

OpenCompass、MME、MMVet、OCRBench、MMMU、MathVista、MMB、AI2D、TextVQA、DocVQA、HallusionBench、Object HalBench 上的单图像结果：

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/QVl0iPtT5aUhlvViyEpgs.png)

^* 我们使用思维链提示来评估此基准。

⁺ 令牌密度：最大分辨率下每个视觉令牌编码的像素数，即最大分辨率下的像素数 / 视觉令牌数。

注意：对于专有模型，我们根据官方 API 文档中定义的图像编码收费策略计算令牌密度，这提供了一个上限估计。

点击查看 Mantis Eval、BLINK Val、Mathverse mv、Sciverse mv、MIRB 上的多图像结果。

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/o6FGHytRhzeatmhxq0Dbi.png)

^* 我们自行评估官方发布的检查点。

点击查看 Video - MME 和 Video - ChatGPT 上的视频结果。

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/jmrjoRr8SFLkrstjDmpaV.png)

点击查看 TextVQA、VizWiz、VQAv2、OK - VQA 上的少样本结果。

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/zXIuiCTTe-POqKGHszdn0.png)

* 表示零图像样本和两个额外的文本样本，遵循 Flamingo。

⁺ 我们评估未进行 SFT 的预训练检查点。

示例

点击查看更多案例。

我们在终端设备上部署了 MiniCPM-V 2.6。演示视频是在 iPad Pro 上的原始屏幕录制，未经过编辑。

多图像对话

点击查看使用多图像输入运行 MiniCPM-V 2.6 的 Python 代码。

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

上下文少样本学习

点击查看使用少样本输入运行 MiniCPM-V 2.6 的 Python 代码。

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

视频对话

点击查看使用视频输入运行 MiniCPM-V 2.6 的 Python 代码。

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

🔧 技术细节

llama.cpp 推理

MiniCPM-V 2.6 可以使用 llama.cpp 运行。更多细节请查看我们的 llama.cpp 分支。

Int4 量化版本

下载 Int4 量化版本以降低 GPU 内存（7GB）使用：MiniCPM-V-2_6-int4。

📄 许可证

模型许可证

本仓库中的代码根据 Apache - 2.0 许可证发布。
MiniCPM-V 系列模型权重的使用必须严格遵循 MiniCPM 模型许可证。
MiniCPM 的模型和权重完全免费用于学术研究。填写 "问卷" 进行注册后，MiniCPM-V 2.6 权重也可免费用于商业用途。

声明

作为一个大语言模型，MiniCPM-V 2.6 通过学习大量的多模态语料生成内容，但它无法理解、表达个人观点或进行价值判断。MiniCPM-V 2.6 生成的任何内容均不代表模型开发者的观点和立场。
我们不对使用 MinCPM-V 模型产生的任何问题负责，包括但不限于数据安全问题、舆论风险，或因模型的误导、误用、传播或滥用而产生的任何风险和问题。

关键技术和其他多模态项目

👏 欢迎探索 MiniCPM-V 2.6 的关键技术和我们团队的其他多模态项目： VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

引用

如果您认为我们的工作有帮助，请考虑引用我们的论文 📝 并给这个项目点赞 ❤️！

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}