模型简介
模型特点
模型能力
使用案例
license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE pipeline_tag: image-text-to-text library_name: transformers base_model:
- OpenGVLab/InternViT-6B-448px-V2_5
- Qwen/Qwen2.5-72B base_model_relation: merge language:
- multilingual tags:
- internvl
- custom_code
InternVL3-78B预训练版
[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]
[🆕 博客] [🗨️ 对话演示] [🤗 HF演示] [🚀 快速开始] [📖 文档]

简介
这是InternVL3-78B的预训练版本,已完成原生多模态预训练但未进行后训练(即SFT和MPO)。若不确定使用哪个版本,请选用InternVL3-78B版本。
我们推出InternVL3系列——一个展现卓越综合性能的先进多模态大语言模型(MLLM)。相比InternVL 2.5,InternVL3具备更强大的多模态感知与推理能力,并将多模态能力拓展至工具使用、GUI代理、工业图像分析、3D视觉感知等新领域。通过与Qwen2.5 Chat模型的对比(其对应预训练基模型被用作InternVL3语言组件的初始化),得益于原生多模态预训练技术,InternVL3系列在文本综合表现上甚至超越了Qwen2.5系列。
InternVL3家族
下表概述了InternVL3系列模型:
模型名称 | 视觉部分 | 语言部分 | HF链接 |
---|---|---|---|
InternVL3-1B | InternViT-300M-448px-V2_5 | Qwen2.5-0.5B | 🤗 链接 |
InternVL3-2B | InternViT-300M-448px-V2_5 | Qwen2.5-1.5B | 🤗 链接 |
InternVL3-8B | InternViT-300M-448px-V2_5 | Qwen2.5-7B | 🤗 链接 |
InternVL3-9B | InternViT-300M-448px-V2_5 | internlm3-8b-instruct | 🤗 链接 |
InternVL3-14B | InternViT-300M-448px-V2_5 | Qwen2.5-14B | 🤗 链接 |
InternVL3-38B | InternViT-6B-448px-V2_5 | Qwen2.5-32B | 🤗 链接 |
InternVL3-78B | InternViT-6B-448px-V2_5 | Qwen2.5-72B | 🤗 链接 |
模型架构
如InternVL3所示,模型延续了InternVL 2.5及其前代1.5和2.0版本的"ViT-MLP-LLM"架构范式。新版本中,我们通过随机初始化的MLP投影器,将增量预训练的InternViT与包括InternLM 3和Qwen 2.5在内的多种预训练LLM进行整合。
延续前代设计,我们采用像素反洗牌操作将视觉token数量缩减至原图的1/4,并沿袭InternVL 1.5的动态分辨率策略,将图像划分为448×448像素的图块。自InternVL 2.0起新增对多图像和视频数据的支持。
值得注意的是,InternVL3集成了可变视觉位置编码(V2PE),该技术为视觉token采用更小更灵活的位置增量。得益于V2PE,InternVL3展现出优于前代的长上下文理解能力。
训练策略
原生多模态预训练
我们提出原生多模态预训练方法,将语言与视觉学习统一到单一预训练阶段。不同于标准范式(先训练纯语言模型再适配多模态),该方法将多模态数据(如图文对、视频文本或图文交错序列)与大规模文本语料交错训练。这种统一训练方案使模型能同步学习语言和多模态表征,最终增强其处理视觉语言任务的能力而无需额外对齐模块。详见论文。
监督微调
该阶段采用了InternVL2.5提出的随机JPEG压缩、平方损失重加权和多模态数据打包技术。相比InternVL2.5,InternVL3的SFT阶段主要进步在于使用了更高质量、更多样化的训练数据,特别扩展了工具使用、3D场景理解、GUI操作、长上下文任务、视频理解、科学图表、创意写作和多模态推理等训练样本。
混合偏好优化
预训练和SFT阶段模型基于历史真实token预测下一token,而推理时则基于自身预测输出,这种分布差异会损害思维链推理能力。通过MPO引入正负样本的额外监督来对齐响应分布与真实分布,提升推理性能。具体地,MPO的训练目标由偏好损失\(\mathcal{L}{\text{p}}\)、质量损失\(\mathcal{L}{\text{q}}\)和生成损失\(\mathcal{L}_{\text{g}}\)加权组合:
$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$
其中\(w_{*}\)为各损失权重。详见MPO论文。
测试时扩展
测试时扩展被证明是增强LLM和MLLM推理能力的有效方法。本工作采用Best-of-N评估策略,使用VisualPRM-8B作为评判模型来选择推理和数学评估的最佳响应。
多模态能力评估
多模态推理与数学
OCR、图表与文档理解
多图像与真实世界理解
综合多模态与幻觉评估
视觉定位
多模态多语言理解
视频理解
GUI定位
空间推理
语言能力评估
通过与Qwen2.5 Chat系列对比(其对应预训练基模型用于初始化InternVL3语言组件),得益于原生多模态预训练,InternVL3系列在文本综合表现上超越Qwen2.5系列。注意Qwen2.5的评估分数可能与其官方报告不同,因我们采用OpenCompass统一评估提示版本。
消融研究
原生多模态预训练
在InternVL2-8B上的实验表明,仅替换传统MLP预热阶段为原生多模态预训练,模型在多数基准上的表现即与完整多阶段训练的基线相当。若后续在更高质量数据上进行指令微调,多模态任务性能可进一步提升。
混合偏好优化
如下表所示,经MPO微调的模型在七个多模态推理基准上均优于未使用MPO的版本。值得注意的是,MPO训练数据仅为SFT数据的子集,说明性能提升主要源于训练算法而非数据。
可变视觉位置编码
引入V2PE带来多数评估指标的显著提升。消融实验显示,即使针对常规上下文任务,较小位置增量\( \delta \)值也能实现最优性能,这为未来MLLM视觉token位置编码策略优化提供了重要参考。
快速开始
我们提供使用transformers
运行InternVL3-78B
的示例代码。
请使用transformers>=4.37.2以确保模型正常运行。
模型加载
16位精度(bf16/fp16)
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-78B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
BNB 8位量化
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-78B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()
多GPU部署
这种编码方式可避免多GPU推理时因张量不在同一设备导致的错误,通过确保大语言模型(LLM)的首尾层位于同一设备实现。
import math
import torch
from transformers import AutoTokenizer, AutoModel
def split_model(model_name):
device_map = {}
world_size = torch.cuda.device_count()
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers = config.llm_config.num_hidden_layers
# 由于首个GPU将用于ViT,视其算力为半块GPU
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
path = "OpenGVLab/InternVL3-78B"
device_map = split_model('InternVL3-78B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
Transformers推理
import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# 计算原始图像宽高比
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# 寻找最接近的目标宽高比
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# 计算目标宽高
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# 调整图像尺寸
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# 分割图像
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
def split_model(model_name):
device_map = {}
world_size = torch.cuda.device_count()
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers = config.llm_config.num_hidden_layers
# 由于首个GPU将用于ViT,视其算力为半块GPU
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
# 若设置`load_in_8bit=True`需两块80GB GPU
# 若设置`load_in_8bit=False`需至少三块80GB GPU
path = 'OpenGVLab/InternVL3-78B'
device_map = split_model('InternVL3-78B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=False,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
# 在`max_num`中设置最大图块数
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)
# 纯文本对话
question = '你好,你是谁?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
question = '能给我讲个故事吗?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'用户: {question}\n助手: {response}')
# 单图单轮对话
question = '<image>\n请简要描述这张图片。'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'用户: {question}\n助手: {response}')
# 单图多轮对话
question = '<image>\n请详细描述这张图片。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
question = '根据图片写一首诗。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'用户: {question}\n助手: {response}')
# 多图多轮对话(拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = '<image>\n详细描述这两张图片。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
question = '这两张图片有什么异同点?'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'用户: {question}\n助手: {response}')
# 多图多轮对话(独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
question = '图片1: <image>\n图片2: <image>\n详细描述这两张图片。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
question = '这两张图片有什么异同点?'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'用户: {question}\n助手: {response}')
# 批处理推理(单图样本)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questions = ['<image>\n详细描述这张图片。'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
for question, response in zip(questions, responses):
print(f'用户: {question}\n助手: {response}')
# 视频多轮对话
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
transform = build_transform(input_size=input_size)
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(tile) for tile in img]
pixel_values = torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[0])
pixel_values_list.append(pixel_values)
pixel_values = torch.cat(pixel_values_list)
return pixel_values, num_patches_list
video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'帧{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + '这只小熊猫在做什么?'
# 帧1: <image>\n帧2: <image>\n...\n帧8: <image>\n{问题}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
question = '详细描述这段视频。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history, return_history=True)
print(f'用户: {question}\n助手: {response}')
流式输出
除上述方法外,也可使用以下代码实现流式输出:
from transformers import TextIteratorStreamer
from threading import Thread
# 初始化流式输出器
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# 定义生成配置
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# 在独立线程中启动模型对话
thread = Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()
# 初始化空字符串存储生成文本
generated_text = ''
# 通过流式输出器实时获取生成文本
for new_text in streamer:
if new_text == model.conv_template.sep:
break
generated_text += new_text
print(new_text, end='', flush=True) # 在同一行逐块打印生成文本
微调
现有多仓库支持InternVL系列模型微调,包括InternVL、SWIFT、XTurner等,具体微调方法请参阅其文档。
部署
LMDeploy
LMDeploy是用于压缩、部署和服务LLM & VLM的工具包。
# 若lmdeploy<0.7.3需显式设置chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3
LMDeploy将复杂多模态视觉语言模型(VLM)的推理过程抽象为类似大语言模型(LLM)的易用流水线。
"Hello, world"示例
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
model = 'OpenGVLab/InternVL3-78B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=4), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('描述这张图片', image))
print(response.text)
若执行时出现ImportError
,请按提示安装依赖包。
多图像推理
处理多图像时可将所有图像放入列表。注意多图像会导致输入token增多,通常需要增大上下文窗口。
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN
model = 'OpenGVLab/InternVL3-78B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=4), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]
images = [load_image(img_url) for img_url in image_urls]
# 图像编号可改善多图对话
response = pipe((f'图片1: {IMAGE_TOKEN}\n图片2: {IMAGE_TOKEN}\n描述这两张图片', images))
print(response.text)
批量提示推理
批量提示推理非常简单,只需将提示放入列表结构:
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
model = 'OpenGVLab/InternVL3-78B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=4), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('描述这张图片', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)
多轮对话
有两种方式实现多轮对话:一是按OpenAI格式构造消息并使用前述方法,二是使用pipeline.chat
接口。
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
model = 'OpenGVLab/InternVL3-78B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=4), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('描述这张图片', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('这个女人在做什么?', session=sess, gen_config=gen_config)
print(sess.response.text)
服务化
LMDeploy的api_server
能一键将模型打包为服务,提供的RESTful API兼容OpenAI接口。以下是服务启动示例:
lmdeploy serve api_server OpenGVLab/Intern









