InternVL3-9B-AWQ开源多模态大模型 - 多场景应用，感知推理超强大！

首页

Internvl3 9B AWQ

由 OpenGVLab 开发

InternVL3-9B是InternVL3系列中的一款多模态大语言模型，具备卓越的多模态感知与推理能力，支持工具使用、GUI代理、工业图像分析、3D视觉感知等多种应用场景。

文本生成图像

Transformers

其他开源协议:MIT #多模态推理 #原生多模态预训练 #长上下文理解

下载量 214

发布时间 : 4/17/2025

模型简介

InternVL3-9B采用'ViT-MLP-LLM'架构，整合了InternViT视觉编码器和InternLM3语言模型，通过原生多模态预训练方法实现强大的多模态理解与生成能力。

模型特点

原生多模态预训练

采用统一训练方案同时学习语言和多模态表示，无需单独的校准或桥接模块

可变视觉位置编码(V2PE)

支持更好的长上下文理解能力

混合偏好优化(MPO)

通过正负样本监督提升推理性能

多模态扩展能力

支持工具使用、GUI操作、3D视觉感知等多样化应用

模型能力

多模态推理

数学计算

OCR识别

图表理解

文档分析

多图像理解

视频理解

GUI定位

空间推理

多语言理解

使用案例

工业应用

工业图像分析

用于工业场景中的缺陷检测和质量控制

交互应用

GUI代理

自动化GUI操作和界面理解

3D应用

3D场景理解

理解和分析3D场景信息

🚀 InternVL3-9B

InternVL3-9B 是一个先进的多模态大语言模型（MLLM），相比前代模型，它在多模态感知、推理等能力上有显著提升，还拓展了工具使用、GUI 代理、工业图像分析等多模态能力。此外，该模型在文本性能上也表现出色，优于 Qwen2.5 系列。

🚀 快速开始

我们提供了使用 transformers 运行 InternVL3-9B 的示例代码。

⚠️ 重要提示

请使用 transformers>=4.37.2 以确保模型正常工作。

模型加载

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-9B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-9B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多 GPU 情况

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-9B"
device_map = split_model('InternVL3-9B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 进行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-9B'
device_map = split_model('InternVL3-9B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式输出

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主要特性

卓越的多模态能力：相比 InternVL 2.5，InternVL3 展现出更出色的多模态感知和推理能力，还将多模态能力拓展到工具使用、GUI 代理、工业图像分析、3D 视觉感知等领域。
优秀的文本性能：通过原生多模态预训练，InternVL3 系列在整体文本性能上优于 Qwen2.5 系列。
灵活的模型架构：沿用 “ViT - MLP - LLM” 范式，集成新的增量预训练 InternViT 和多种预训练 LLM。
先进的训练策略：采用原生多模态预训练、监督微调、混合偏好优化和测试时缩放等策略，提升模型性能。

📚 详细文档

InternVL3 家族

以下表格概述了 InternVL3 系列：

模型名称	视觉部分	语言部分	HF 链接
InternVL3 - 1B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 0.5B](https://huggingface.co/Qwen/Qwen2.5 - 0.5B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 1B)
InternVL3 - 2B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 1.5B](https://huggingface.co/Qwen/Qwen2.5 - 1.5B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 2B)
InternVL3 - 8B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 8B)
InternVL3 - 9B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[internlm3 - 8b - instruct](https://huggingface.co/internlm/internlm3 - 8b - instruct)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 9B)
InternVL3 - 14B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 14B](https://huggingface.co/Qwen/Qwen2.5 - 14B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 14B)
InternVL3 - 38B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 32B](https://huggingface.co/Qwen/Qwen2.5 - 32B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 38B)
InternVL3 - 78B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 72B](https://huggingface.co/Qwen/Qwen2.5 - 72B)	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 78B)

模型架构

[InternVL3](https://internvl.github.io/blog/2025 - 04 - 11 - InternVL - 3/) 保留了与 [InternVL 2.5](https://internvl.github.io/blog/2024 - 12 - 05 - InternVL - 2.5/) 及其前身 InternVL 1.5 和 2.0 相同的模型架构，遵循 “ViT - MLP - LLM” 范式。在新版本中，使用随机初始化的 MLP 投影器，将新的增量预训练 InternViT 与多种预训练 LLM（包括 InternLM 3 和 Qwen 2.5）集成。

与之前版本一样，应用了像素重排操作，将视觉标记数量减少到原来的四分之一。此外，采用了与 InternVL 1.5 类似的动态分辨率策略，将图像分割成 448×448 像素的图块。从 InternVL 2.0 开始的关键区别在于，还增加了对多图像和视频数据的支持。

值得注意的是，在 InternVL3 中集成了可变视觉位置编码 (V2PE)，它为视觉标记使用更小、更灵活的位置增量。得益于 V2PE，InternVL3 相比其前身表现出更好的长上下文理解能力。

训练策略

原生多模态预训练

提出了原生多模态预训练方法，将语言和视觉学习整合到一个预训练阶段。与先训练纯语言模型，然后使其适应处理其他模态的标准范式不同，该方法将多模态数据（如图文、视频文本或图文交错序列）与大规模文本语料交织在一起。这种统一的训练方案允许模型同时学习语言和多模态表示，最终增强其处理视觉 - 语言任务的能力，而无需单独的对齐或桥接模块。更多细节请参阅我们的论文。

监督微调

在这个阶段，InternVL2.5 中提出的随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术也应用于 InternVL3 系列。与 InternVL2.5 相比，InternVL3 的 SFT 阶段的主要进步在于使用了更高质量和更多样化的训练数据。具体来说，进一步扩展了工具使用、3D 场景理解、GUI 操作、长上下文任务、视频理解、科学图表、创意写作和多模态推理的训练样本。

混合偏好优化

在预训练和 SFT 期间，模型在先前真实标记的条件下预测下一个标记。然而，在推理期间，模型根据自己的先前输出预测每个标记。真实标记和模型预测标记之间的这种差异引入了分布偏移，这可能会损害模型的思维链 (CoT) 推理能力。为了缓解这个问题，采用了 MPO，它引入了来自正样本和负样本的额外监督，以使模型响应分布与真实分布对齐，从而提高推理性能。具体来说，MPO 的训练目标是偏好损失 $\mathcal{L}{\text{p}}$、质量损失 $\mathcal{L}{\text{q}}$ 和生成损失 $\mathcal{L}_{\text{g}}$ 的组合，可以表述为：

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

其中 $w_{*}$ 表示分配给每个损失组件的权重。有关 MPO 的更多细节，请参阅我们的论文。

测试时缩放

测试时缩放已被证明是增强 LLM 和 MLLM 推理能力的有效方法。在这项工作中，使用了 Best - of - N 评估策略，并采用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作为评估模型，为推理和数学评估选择最佳响应。

多模态能力评估

多模态推理和数学

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/reasoning.png)

OCR、图表和文档理解

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ocr.png)

多图像和现实世界理解

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multi - images.png)

综合多模态和幻觉评估

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/comprehensive.png)

视觉定位

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/grounding.png)

多模态多语言理解

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multilingual.png)

视频理解

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/video.png)

GUI 定位

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/gui.png)

空间推理

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/vsi.png)

语言能力评估

将 InternVL3 与 Qwen2.5 Chat 模型进行比较，其对应的预训练基础模型用于初始化 InternVL3 中的语言组件。得益于原生多模态预训练，InternVL3 系列在整体文本性能上比 Qwen2.5 系列更好。

请注意，Qwen2.5 系列的评估分数可能与官方报告的不同，因为在所有数据集上采用了表中提供的提示版本进行 OpenCompass 评估。

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/text.png)

消融实验

原生多模态预训练

在保持 InternVL2 - 8B 模型的架构、初始化参数和训练数据完全不变的情况下进行实验。传统上，InternVL2 - 8B 采用的训练管道是先进行 MLP 预热阶段进行特征对齐，然后进行指令微调阶段。在实验中，用原生多模态预训练过程取代了传统的 MLP 预热阶段。这种修改隔离了原生多模态预训练对模型整体多模态能力的贡献。

下图中的评估结果表明，采用原生多模态预训练的模型在大多数基准测试中的性能与经过完整多阶段训练的 InternVL2 - 8B 基线相当。此外，在更高质量数据上进行指令微调后，模型在评估的多模态任务中表现出进一步的性能提升。这些发现强调了原生多模态预训练在赋予 MLLM 强大多模态能力方面的效率。

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - native.png)

混合偏好优化

如下表所示，与未使用 MPO 进行微调的模型相比，使用 MPO 进行微调的模型在七个多模态推理基准测试中表现出更出色的推理性能。具体来说，InternVL3 - 78B 和 InternVL3 - 38B 分别比其对应模型高出 4.1 和 4.5 分。值得注意的是，用于 MPO 的训练数据是用于 SFT 的训练数据的子集，这表明性能提升主要源于训练算法而非训练数据。

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - mpo.png)

可变视觉位置编码

如下表所示，引入 V2PE 导致大多数评估指标的性能显著提升。此外，通过改变位置增量 $ \delta $ 进行的消融实验表明，即使对于主要涉及传统上下文的任务，相对较小的 $ \delta $ 值也能实现最佳性能。这些发现为未来改进 MLLM 中视觉标记的位置编码策略提供了重要见解。

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - v2pe.png)

🔧 技术细节

训练策略

原生多模态预训练：将语言和视觉学习整合到一个预训练阶段，通过交织多模态数据和大规模文本语料，使模型同时学习语言和多模态表示。
监督微调：使用随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术，采用更高质量和更多样化的训练数据。
混合偏好优化：引入正样本和负样本的额外监督，使模型响应分布与真实分布对齐，提高推理性能。
测试时缩放：采用 Best - of - N 评估策略，使用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作为评估模型，选择最佳响应。

模型架构改进

集成 V2PE：在 InternVL3 中集成可变视觉位置编码 (V2PE)，使用更小、更灵活的位置增量，提升长上下文理解能力。
支持多图像和视频数据：从 InternVL 2.0 开始增加对多图像和视频数据的支持。

📦 安装指南

LMDeploy

LMDeploy 是一个用于压缩、部署和服务 LLM 和 VLM 的工具包。

# 如果 lmdeploy<0.7.3，需要显式设置 chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

💻 使用示例

LMDeploy 示例

“Hello, world” 示例

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-9B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

多图像推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-9B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# 对图像进行编号有助于多图像对话
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-9B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-9B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务部署

lmdeploy serve api_server OpenGVLab/InternVL3-9B --chat-template internvl2_5 --server-port 23333 --tp 1

使用 OpenAI 风格的接口，需要安装 OpenAI：

pip install openai

然后使用以下代码进行 API 调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📄 许可证

本项目遵循 MIT 许可证发布。本项目使用预训练的 Qwen2.5 作为组件，该组件遵循 Qwen 许可证。

引用

如果您在研究中发现本项目有用，请考虑引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}