InternVL3-38B开源多模态大语言模型 - 感知推理强还拓展多模态能力

首页

Internvl3 38B

由 FriendliAI 开发

InternVL3-38B 是一款先进的多模态大语言模型，在多模态感知、推理等能力上表现卓越，相较于前代模型有显著提升，还拓展了工具使用、GUI 代理等多模态能力。

文本生成图像

Transformers

其他开源协议:其他 #多模态推理 #工具使用代理 #动态分辨率处理

下载量 166

发布时间 : 4/12/2025

模型简介

InternVL3-38B 是一款多模态大语言模型，具备强大的多模态感知和推理能力，支持工具使用、GUI 代理等多种应用场景。

模型特点

先进的多模态能力

相比 InternVL 2.5，InternVL3 展现出更出色的多模态感知和推理能力，还将多模态能力拓展到工具使用、GUI 代理、工业图像分析、3D 视觉感知等领域。

优秀的语言性能

与 Qwen2.5 Chat 模型相比，得益于原生多模态预训练，InternVL3 系列在整体文本性能上表现更优。

灵活的模型架构

采用“ViT - MLP - LLM”范式，集成新的增量预训练 InternViT 和多种预训练大语言模型，如 InternLM 3 和 Qwen 2.5。

高效的训练策略

提出原生多模态预训练方法，将语言和视觉学习整合到一个预训练阶段；在监督微调阶段使用高质量、多样化的训练数据；采用混合偏好优化（MPO）方法提升推理性能。

模型能力

多模态感知

多模态推理

工具使用

GUI 代理

工业图像分析

3D 视觉感知

文本生成

图像分析

使用案例

多模态推理

多模态推理任务

在多个多模态推理基准测试中表现出色。

InternVL3-38B 比其对应模型高出 4.5 分。

GUI 操作

GUI 代理

支持 GUI 操作任务。

工业图像分析

支持工业图像分析任务。

🚀 InternVL3-38B

【GitHub】【InternVL 1.0】【InternVL 1.5】【InternVL 2.5】【InternVL2.5-MPO】【InternVL3】

【博客】【聊天演示】【HF 演示】【快速开始】【文档】

🚀 快速开始

我们提供了使用 transformers 运行 InternVL3-38B 的示例代码。

⚠️ 重要提示

请使用 transformers>=4.37.2 以确保模型正常工作。

模型加载

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-38B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-38B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多 GPU

编写此代码的原因是为了避免在多 GPU 推理期间由于张量不在同一设备上而发生的错误。通过确保大语言模型（LLM）的第一层和最后一层在同一设备上，我们可以防止此类错误。

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-38B"
device_map = split_model('InternVL3-38B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 进行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_lay

✨ 主要特性

先进的多模态能力：相比 InternVL 2.5，InternVL3 展现出更出色的多模态感知和推理能力，还将多模态能力拓展到工具使用、GUI 代理、工业图像分析、3D 视觉感知等领域。
优秀的语言性能：与 Qwen2.5 Chat 模型相比，得益于原生多模态预训练，InternVL3 系列在整体文本性能上表现更优。
灵活的模型架构：采用“ViT - MLP - LLM”范式，集成新的增量预训练 InternViT 和多种预训练大语言模型，如 InternLM 3 和 Qwen 2.5。
高效的训练策略：提出原生多模态预训练方法，将语言和视觉学习整合到一个预训练阶段；在监督微调阶段使用高质量、多样化的训练数据；采用混合偏好优化（MPO）方法提升推理性能。

📦 模型信息

属性	详情
模型类型	多模态大语言模型
基础模型	OpenGVLab/InternViT - 6B - 448px - V2_5、Qwen/Qwen2.5 - 32B
基础模型关系	合并
训练数据集	OpenGVLab/MMPR - v1.2
支持语言	多语言
标签	internvl、custom_code
许可证	qwen

📚 详细文档

模型架构

如下图所示，InternVL3 保留了与 InternVL 2.5 及其前身 InternVL 1.5 和 2.0 相同的模型架构，遵循“ViT - MLP - LLM”范式。在这个新版本中，我们使用随机初始化的 MLP 投影器，将新的增量预训练 InternViT 与多种预训练大语言模型（包括 InternLM 3 和 Qwen 2.5）集成在一起。

模型架构图

与之前的版本一样，我们应用了像素重排操作，将视觉标记的数量减少到原来的四分之一。此外，我们采用了与 InternVL 1.5 类似的动态分辨率策略，将图像划分为 448×448 像素的图块。从 InternVL 2.0 开始，关键的区别在于我们还引入了对多图像和视频数据的支持。

值得注意的是，在 InternVL3 中，我们集成了可变视觉位置编码（V2PE），它为视觉标记使用更小、更灵活的位置增量。得益于 V2PE，InternVL3 与其前身相比，表现出更好的长上下文理解能力。

训练策略

原生多模态预训练

我们提出了一种原生多模态预训练方法，将语言和视觉学习整合到一个预训练阶段。与先训练纯语言模型，然后使其适应处理其他模态的标准范式不同，我们的方法将多模态数据（如图文、视频文本或图文交错序列）与大规模文本语料库交织在一起。这种统一的训练方案使模型能够同时学习语言和多模态表示，最终增强其处理视觉语言任务的能力，而无需单独的对齐或桥接模块。更多细节请参考我们的论文。

监督微调

在这个阶段，InternVL2.5 中提出的随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术也应用于 InternVL3 系列。InternVL3 在监督微调阶段与 InternVL2.5 相比的主要进步在于使用了更高质量、更多样化的训练数据。具体来说，我们进一步扩展了工具使用、3D 场景理解、GUI 操作、长上下文任务、视频理解、科学图表、创意写作和多模态推理的训练样本。

混合偏好优化

在预训练和监督微调期间，模型根据之前的真实标记来预测下一个标记。然而，在推理期间，模型根据自己的先验输出来预测每个标记。真实标记和模型预测标记之间的这种差异会引入分布偏移，这可能会削弱模型的思维链（CoT）推理能力。为了缓解这个问题，我们采用了 MPO 方法，它引入了来自正样本和负样本的额外监督，以使模型响应分布与真实分布对齐，从而提高推理性能。具体来说，MPO 的训练目标是偏好损失 $\mathcal{L}{\text{p}}$、质量损失 $\mathcal{L}{\text{q}}$ 和生成损失 $\mathcal{L}_{\text{g}}$ 的组合，可以表示为：

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}} $$

其中 $w_{*}$ 表示每个损失组件的权重。更多关于 MPO 的细节请参考我们的论文。

测试时缩放

测试时缩放已被证明是一种有效的方法，可以增强大语言模型和多模态大语言模型的推理能力。在这项工作中，我们使用 Best - of - N 评估策略，并采用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作为评估模型，为推理和数学评估选择最佳响应。

🔧 技术细节

原生多模态预训练

我们在 InternVL2 - 8B 模型上进行实验，同时保持其架构、初始化参数和训练数据完全不变。传统上，InternVL2 - 8B 采用的训练流程是先进行 MLP 预热阶段进行特征对齐，然后进行指令微调阶段。在我们的实验中，我们用原生多模态预训练过程代替了传统的 MLP 预热阶段。这种修改隔离了原生多模态预训练对模型整体多模态能力的贡献。

下图的评估结果表明，采用原生多模态预训练的模型在大多数基准测试中的性能与经过完整多阶段训练的 InternVL2 - 8B 基线相当。此外，在使用更高质量数据进行指令微调后，模型在评估的多模态任务中表现出进一步的性能提升。这些发现强调了原生多模态预训练在赋予多模态大语言模型强大多模态能力方面的效率。

原生多模态预训练评估结果

混合偏好优化

如下表所示，与未使用 MPO 进行微调的模型相比，使用 MPO 进行微调的模型在七个多模态推理基准测试中表现出更优的推理性能。具体来说，InternVL3 - 78B 和 InternVL3 - 38B 分别比其对应模型高出 4.1 和 4.5 分。值得注意的是，MPO 使用的训练数据是监督微调使用数据的子集，这表明性能提升主要源于训练算法，而非训练数据。

混合偏好优化评估结果