开源Ovis2-1B-dev多模态大语言模型，高性能处理视频多图，强化推理能力！

首页

Ovis2 1B Dev

由 Isotr0py 开发

Ovis2-1B是多模态大语言模型（MLLM）Ovis系列的最新成员，专注于视觉与文本嵌入的结构对齐，具有小模型高性能、强化推理能力、视频与多图处理以及多语言OCR增强等特性。

文本生成图像

Transformers

支持多种语言开源协议:Apache-2.0 #多模态大语言模型 #视觉文本对齐 #多语言OCR增强

下载量 79

发布时间 : 4/9/2025

模型简介

Ovis2-1B是AIDC-AI发布的多模态大语言模型，旨在实现视觉与文本嵌入的结构对齐。作为Ovis1.6的迭代升级，Ovis2在数据构建和训练方法上均有显著提升，特别适合处理复杂的视觉信息和多语言OCR任务。

模型特点

小模型高性能

通过优化训练策略，使小规模模型实现更高能力密度，展现跨层级领先优势。

强化推理能力

结合指令微调与偏好学习，显著增强思维链（CoT）推理能力。

视频与多图处理

将视频和多图数据纳入训练，提升跨帧/跨图像的复杂视觉信息处理能力。

多语言OCR增强

在英汉双语基础上优化多语言OCR能力，提升从表格/图表等复杂视觉元素中提取结构化数据的效果。

模型能力

图像理解

文本生成

视频理解

多图分析

多语言OCR

复杂推理

使用案例

视觉问答

图像内容描述

对输入图像进行详细描述

在MMBench-V1.1测试集上达到68.4分

视觉推理

基于图像内容进行逻辑推理

在MathVista测试精简集上达到59.4分

文档理解

表格数据提取

从复杂表格中提取结构化数据

在OCRBench上达到89.0分

视频理解

视频内容分析

理解视频中的动作和场景

在VideoMME(带字幕)上达到49.5分

🚀 Ovis2-1B

Ovis2-1B是一款多模态大语言模型，继承了Ovis系列的创新架构设计，在数据集管理和训练方法上有显著改进，具备小模型高性能、增强推理能力、支持视频和多图像处理以及多语言OCR等特性。

🚀 快速开始

你可以按照以下步骤使用Ovis2-1B模型：

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加载模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 单图像输入
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

## 思维链风格输入
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'

## 多图像输入
# image_paths = [
#     '/data/images/example_1.jpg',
#     '/data/images/example_2.jpg',
#     '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text

## 视频输入 (需要 `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
#     total_frames = int(clip.fps * clip.duration)
#     if total_frames <= num_frames:
#         sampled_indices = range(total_frames)
#     else:
#         stride = total_frames / num_frames
#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text

## 纯文本输入
# images = []
# max_partition = None
# text = 'Hello'
# query = text

# 格式化对话
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# 生成输出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'输出:\n{output}')

批量推理

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加载模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 预处理输入
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# 生成输出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'输出 {i + 1}:\n{output}\n')

✨ 主要特性

小模型高性能：优化的训练策略使小模型实现更高的能力密度，展现跨层级的领先优势。
增强推理能力：通过指令微调与偏好学习相结合，显著增强思维链（CoT）推理能力。
视频和多图像处理：将视频和多图像数据纳入训练，增强处理跨帧和图像的复杂视觉信息的能力。
多语言支持和OCR：增强英语和中文以外的多语言OCR能力，改进从表格和图表等复杂视觉元素中提取结构化数据的能力。

📦 安装指南

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

💻 使用示例

基础用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加载模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 单图像输入
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

# 格式化对话
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# 生成输出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'输出:\n{output}')

高级用法

# 批量推理示例
import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加载模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 预处理输入
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# 生成输出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'输出 {i + 1}:\n{output}\n')

📚 详细文档

模型库

Ovis多模态大语言模型	视觉Transformer（ViT）	大语言模型（LLM）	模型权重	演示
Ovis2-1B	aimv2-large-patch14-448	Qwen2.5-0.5B-Instruct	Huggingface	Space
Ovis2-2B	aimv2-large-patch14-448	Qwen2.5-1.5B-Instruct	Huggingface	Space
Ovis2-4B	aimv2-huge-patch14-448	Qwen2.5-3B-Instruct	Huggingface	Space
Ovis2-8B	aimv2-huge-patch14-448	Qwen2.5-7B-Instruct	Huggingface	Space
Ovis2-16B	aimv2-huge-patch14-448	Qwen2.5-14B-Instruct	Huggingface	Space
Ovis2-34B	aimv2-1B-patch14-448	Qwen2.5-32B-Instruct	Huggingface	-

性能评估

我们使用 VLMEvalKit 对Ovis2进行评估，该工具也用于OpenCompass 多模态和推理排行榜。

image/png

图像基准测试

基准测试	Qwen2.5-VL-3B	SAIL-VL-2B	InternVL2.5-2B-MPO	Ovis1.6-3B	InternVL2.5-1B-MPO	Ovis2-1B	Ovis2-2B
MMBench-V1.1_测试集	77.1	73.6	70.7	74.1	65.8	68.4	76.9
MMStar	56.5	56.5	54.9	52.0	49.5	52.1	56.7
MMMU_验证集	51.4	44.1	44.6	46.7	40.3	36.1	45.6
MathVista_{测试迷你集}	60.1	62.8	53.4	58.9	47.7	59.4	64.1
HallusionBench	48.7	45.9	40.7	43.8	34.8	45.2	50.2
AI2D	81.4	77.4	75.1	77.8	68.5	76.4	82.7
OCRBench	83.1	83.1	83.8	80.1	84.3	89.0	87.3
MMVet	63.2	44.2	64.2	57.6	47.2	50.0	58.3
MMBench_测试集	78.6	77	72.8	76.6	67.9	70.2	78.9
MMT-Bench_验证集	60.8	57.1	54.4	59.2	50.8	55.5	61.7
RealWorldQA	66.5	62	61.3	66.7	57	63.9	66.0
BLINK	48.4	46.4	43.8	43.8	41	44.0	47.9
QBench	74.4	72.8	69.8	75.8	63.3	71.3	76.2
ABench	75.5	74.5	71.1	75.2	67.5	71.3	76.6
MTVQA	24.9	20.2	22.6	21.1	21.7	23.7	25.6

视频基准测试

基准测试	Qwen2.5-VL-3B	InternVL2.5-2B	InternVL2.5-1B	Ovis2-1B	Ovis2-2B
VideoMME(无字幕/有字幕)	61.5/67.6	51.9 / 54.1	50.3 / 52.3	48.6/49.5	57.2/60.8
MVBench	67.0	68.8	64.3	60.32	64.9
MLVU(均值/全局均值)	68.2/-	61.4/-	57.3/-	58.5/3.66	68.6/3.86
MMBench-视频	1.63	1.44	1.36	1.26	1.57
TempCompass	64.4	-	-	51.43	62.64

📄 许可证

本项目采用 Apache许可证2.0版（SPDX许可证标识符：Apache-2.0）。

📚 引用

如果你发现Ovis模型有用，请考虑引用以下论文：

@article{lu2024ovis,
  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
  year={2024},
  journal={arXiv:2405.20797}
}