许可协议:apache-2.0
数据集:
- VARGPT-family/VARGPT_datasets
语言:
- 英语
评估指标:
- 准确率
- F1值
管道标签:任意到任意
库名称:transformers
VARGPT-v1.1:通过迭代指令调优与强化学习提升视觉自回归统一大模型

VARGPT-v1.1(7B+2B)在统一模型中区分理解与生成两大范式:通过预测下一标记实现视觉理解,通过预测下一尺度实现视觉生成。
我们提供简易生成流程,更多细节请访问GitHub:VARGPT-v1.1。
多模态理解
多模态理解推理演示,可执行以下代码:
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching
model_id = "VARGPT-family/VARGPT-v1.1"
prepare_vargpt_qwen2vl_v1_1(model_id)
model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float32,
low_cpu_mem_usage=True,
).to(0)
patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "请详细解释这张表情包"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)
raw_image = Image.open(image_file)
inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to(0, torch.float32)
output = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
多模态生成
文本到图像生成推理演示,可执行以下代码:
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_qwen_v1_1.modeling_vargpt_qwen2_vl import VARGPTQwen2VLForConditionalGeneration
from vargpt_qwen_v1_1.prepare_vargpt_v1_1 import prepare_vargpt_qwen2vl_v1_1
from vargpt_qwen_v1_1.processing_vargpt_qwen2_vl import VARGPTQwen2VLProcessor
from patching_utils.patching import patching
model_id = "VARGPT-family/VARGPT-v1.1"
prepare_vargpt_qwen2vl_v1_1(model_id)
model = VARGPTQwen2VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float32,
low_cpu_mem_usage=True,
).to(0)
patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTQwen2VLProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "请创作一张力量金属专辑封面:幻想风格插画,主体为白色猎鹰"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)
inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False)
print(processor.decode(output[0][:-1], skip_special_tokens=True))
引用
若使用本数据集或模型,请引用以下文献:
论文VARGPT-v1.1:通过迭代指令调优与强化学习提升视觉自回归统一大模型:
@misc{zhuang2025vargptunifiedunderstandinggeneration,
title={VARGPT:视觉自回归多模态大语言模型中的统一理解与生成},
author={庄贤威 and 谢雨欣 and 邓雨凡 and 梁立明 and 茹靖涵 and 尹玉国 and 邹月娴},
year={2025},
eprint={2501.12327},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.12327},
}
@misc{zhuang2025vargptv11improvevisualautoregressive,
title={VARGPT-v1.1:通过迭代指令调优与强化学习提升视觉自回归统一大模型},
author={庄贤威 and 谢雨欣 and 邓雨凡 and 杨东超 and 梁立明 and 茹靖涵 and 尹玉国 and 邹月娴},
year={2025},
eprint={2504.02949},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.02949},
}