基础模型:
- Dream-org/Dream-v0-Instruct-7B
数据集:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
- lmms-lab/LLaVA-NeXT-Data
语言:
- en
库名称: transformers
许可证: apache-2.0
评估指标:
- 准确率
管道标签: 图像文本到文本
标签:
- 扩散多模态大语言模型
- MLLM
- 离散扩散
ü§ó 模型   |    üí¨ 演示:与Dimple聊天   |   üìë 论文   |    ‚ú® 代码  
üíß Dimple-7B
Dimple 是首个结合自回归与扩散训练范式的离散扩散多模态大语言模型(DMLLM)。其架构类似Qwen和LLaVA,同时引入了自回归后扩散的训练策略:
- 第一阶段:通过自回归微调实现对齐与初步指令调优
- 第二阶段:基于扩散的微调增强指令跟随能力
在LLaVA-NEXT相同数据集上训练后,Dimple-7B以3.9%的优势超越LLaVA-NEXT-7B,证明在相似训练成本下,扩散式多模态语言模型可媲美自回归模型。
üîç 核心亮点
- 混合训练:融合自回归与扩散训练范式
- 扩散解码:支持置信解码、随机解码、maskgit式解码及基于熵的解码
- 可控生成:通过结构先验实现格式、结构与长度的细粒度控制
- 类自回归预填充:采用预填充技术提升推理速度
üìä 评估结果
基准测试 |
Dimple-7B (本模型) |
LLaVA-1.5-7B |
LLaVA-NEXT-7B |
Eagle-7B |
Eagle2-9B |
Qwen-VL-7B |
Qwen2.5-VL-7B |
训练样本 |
130万 |
120万 |
130万 |
240万 |
2780万 |
15亿 |
- |
训练token数 |
8亿 |
- |
- |
- |
- |
- |
2.6万亿 |
基座LLM |
Dream (Qwen2.5) |
Vicuna |
Vicuna-1.5 |
Vicuna |
Qwen2.5 |
Qwen |
Qwen2.5 |
GQA |
59.2 |
62.0 |
64.8 |
64.9 |
- |
59.3 |
- |
MMBench (英文测试) |
74.6 |
64.3 |
68.7 |
68.4 |
- |
- |
83.5 |
MME (感知) |
1514 |
1510 |
1519 |
1528 |
- |
- |
- |
MME (认知) |
432 |
- |
332 |
- |
- |
- |
- |
MME (总分) |
1946 |
- |
1851 |
- |
- |
- |
2347 |
POPE |
86.2 |
85.8 |
86.7 |
88.8 |
- |
- |
- |
MMMU (验证集) |
45.2 |
- |
35.8 |
36.3 |
56.1 |
- |
58.6 |
SQA (图像) |
77.1 |
66.8 |
72.8 |
70.0 |
- |
- |
- |
AI2D |
74.4 |
- |
65.4 |
- |
83.9 |
62.3 |
83.9 |
ChartQA |
63.4 |
- |
54.9 |
67.7 |
86.4 |
65.7 |
87.3 |
TextVQA |
61.6 |
- |
64.8 |
- |
83.0 |
- |
- |
OCRBench |
565 |
- |
490 |
529 |
- |
- |
- |
MathVista (精简版) |
42.3 |
- |
33.0 |
- |
63.8 |
37.0 |
68.2 |
MMVet |
41.2 |
31.1 |
47.3 |
- |
62.2 |
- |
67.1 |
üõ†Ô∏è 环境配置
确保环境包含以下版本:
transformers==4.46.2
torch==2.5.1
accelerate==1.6.0
üöÄ 推理示例
import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image
model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
[{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "描述这张图片。"}
]}],
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]
inputs = processor(
text=text,
images=images,
videos=None,
padding="longest",
return_tensors="pt",
)
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=64,
output_history=True,
return_dict_in_generate=True,
steps=64,
temperature=0.2,
top_p=0.95,
alg="origin",
use_cache=True,
alg_p_threshold=0.95,
use_original_confidence=True,
decoding_pipeline="dim",
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("输出:", j, generations[j].split(processor.tokenizer.eos_token)[0])
üìö 引用文献
@misc{dimple,
title={Dimple: 并行解码的离散扩散多模态大语言模型},
author={余润鹏 and 马新寅 and 王新超},
year={2025},
eprint={2505.16990},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.16990},
}