VisualThinker-R1-Zero开源多模态推理模型 - 复现‘顿悟时刻’拓展响应长度

首页

Visualthinker R1 Zero

由 turningpoint-ai 开发

首个在仅一个非监督微调的2B模型上复现'顿悟时刻'和响应长度增加的多模态推理模型

图像生成文本

Safetensors

英语开源协议:MIT #多模态推理 #强化学习优化 #视觉中心任务

下载量 578

发布时间 : 2/28/2025

模型简介

基于Qwen2-VL-2B基础模型，通过强化学习在SAT数据集上训练，提升了视觉中心任务的推理能力

模型特点

顿悟时刻复现

首个在非监督微调的2B模型上成功复现DeepSeek-R1的'顿悟时刻'特征

视觉中心推理

展示了视觉中心任务也能从改进的推理能力中受益

自我反思能力

模型表现出重新思考并纠正错误的涌现能力

模型能力

多模态推理

图像理解

文本生成

视觉中心任务处理

使用案例

视觉推理

物体位置分析

分析图像中物体的相对位置关系

在CVBench上达到59.47%准确率

🚀 VisualThinker-R1-Zero

VisualThinker-R1-Zero是一个专注于多模态推理的项目。它基于非SFT的2B模型，首次成功在多模态推理中实现了“顿悟时刻”和增加响应长度的效果。该模型在CVBench上取得了59.47%的准确率，超越了基础模型约30%，并超过SFT设置约2%。

🚀 快速开始

本项目基于Qwen2-VL-2B模型，直接在SAT数据集上应用强化学习，实现了多模态推理能力的提升。项目代码可在GitHub获取。

✨ 主要特性

首次突破：首次在非SFT的2B模型上成功实现多模态推理的“顿悟时刻”和增加响应长度。
视觉任务受益：以视觉为中心的任务也能从改进的推理能力中受益。在基于视觉的推理任务的强化学习训练中，模型表现出自我反思行为，能够重新思考并纠正错误。例如：

. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .

📦 安装指南

环境要求

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7

安装步骤

安装所需的包：

# 安装transformers
pip install git+https://github.com/huggingface/transformers
# 安装qwen-vl工具
pip install qwen-vl-utils

💻 使用示例

基础用法

from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText

# 直接加载模型
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero", torch_dtype="auto", device_map="auto")
model.eval()

# 准备图像输入
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"

# 准备文本输入
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"

# 创建消息
message = [
    {
        "type": "image",
        "image": image_url,
    },
    {"type": "text", "text": "<image>" + prompt},
]

# 处理输入
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
    text=text,
    image=image,
    padding=True,
    return_tensors="pt",
)
input = input.to("cuda")

# 生成输出
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

# 获取输出
output_text = batch_output_text[0]
print(output_text)

🙌 保持联系

我们随时欢迎进行有意义的讨论、合作，甚至只是一起分享一杯虚拟咖啡。如需联系或加入我们的团队，请访问TurningPoint AI的主页获取联系方式。

📖 致谢

我们衷心感谢DeepSeek、Open-R1、QwenVL、Open-R1-Multimodal、R1-V、SAT和CV-Bench提供的开源资源，这些资源为我们的项目奠定了基础。

🤝 贡献者

以下是来自TurningPoint AI的本项目主要贡献者：

Hengguang Zhou¹^*、Xirui Li¹^*、Ruochen Wang¹^†、Minhao Cheng²、Tianyi Zhou³和Cho-Jui Hsieh¹⁴

^* 项目负责人，^† 主要顾问 ¹ 加州大学洛杉矶分校，² 宾夕法尼亚州立大学，³ 马里兰大学，⁴ 谷歌研究院

✏️ 引用

@misc{zhou2025r1zerosahamomentvisual,
      title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model}, 
      author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2025},
      eprint={2503.05132},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.05132}, 
}