ViGoRL-7b-Spatial开源视觉语言模型 - 精准关联文本与坐标，实现视觉推理定位

首页

Vigorl 7b Spatial

由 gsarch 开发

ViGoRL 是一个通过强化学习微调的视觉语言模型，用于将文本推理步骤与视觉坐标明确关联，实现精确的视觉推理和定位。

文本生成图像

Transformers

#多轮视觉定位 #强化学习微调 #区域级视觉推理

下载量 319

发布时间 : 6/19/2025

模型简介

ViGoRL 是一个视觉语言模型，通过强化学习（RL）进行微调，以将文本推理步骤明确锚定到视觉坐标。受人类视觉认知的启发，ViGoRL 采用多轮视觉定位，动态缩放图像区域以执行细粒度的视觉推理和定位。

模型特点

多轮视觉定位

受人类视觉认知的启发，ViGoRL 采用多轮视觉定位，动态缩放图像区域，以执行细粒度的视觉推理和定位。

精确视觉推理

该模型在需要精确视觉定位和区域级推理的视觉推理任务中表现出色。

多种训练范式

模型使用监督微调（SFT）在通过蒙特卡罗树搜索（MCTS）生成的视觉基础推理轨迹上进行训练，随后使用组相对策略优化（GRPO）进行强化学习。

模型能力

视觉推理

视觉定位

多轮交互

动态缩放图像区域

使用案例

空间推理

SAT - 2

用于空间推理任务

BLINK

用于空间推理任务

RoboSpatial

用于空间推理任务

视觉搜索

V*Bench

用于视觉搜索任务

网页交互和定位

ScreenSpot（Pro 和 V2）

用于网页交互和定位任务

VisualWebArena

用于网页交互和定位任务

🚀 ViGoRL：用于视觉推理的视觉基础强化学习

ViGoRL（Visually Grounded Reinforcement Learning）是一个用于视觉推理的模型。本项目通过强化学习（RL）对视觉语言模型进行微调，将文本推理步骤与视觉坐标明确关联起来。

主要特性

多轮视觉定位：受人类视觉认知的启发，ViGoRL 采用多轮视觉定位，动态缩放图像区域，以执行细粒度的视觉推理和定位。
精确视觉推理：该模型在需要精确视觉定位和区域级推理的视觉推理任务中表现出色。
多种训练范式：模型使用监督微调（SFT）在通过蒙特卡罗树搜索（MCTS）生成的视觉基础推理轨迹上进行训练，随后使用组相对策略优化（GRPO）进行强化学习。

安装指南

暂未提供相关安装步骤，若有需要可参考代码仓库中的说明。

使用示例

基础用法

你可以使用 Hugging Face 的 Transformers 库轻松加载此模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# # default: Load the model on the available device(s)
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "", torch_dtype="auto", device_map="auto"
# ) # replace with any of the ViGoRL models

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.png",
            },
            {"type": "text", "text": "QUERY HERE"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text) # this will output a single tool call turn of the model if version is multiturn.

重要提示：此模型需要系统提示才能正常使用。请参阅模型的聊天模板以获取详细信息。

详细文档

模型概述

此模型在通过蒙特卡罗树搜索（MCTS）生成的视觉基础推理轨迹上使用监督微调（SFT）进行训练，随后使用组相对策略优化（GRPO）进行强化学习。

模型细节

属性	详情
基础架构	Qwen2.5 - 视觉语言（3B 或 7B 参数）
训练范式	1. 在 MCTS 生成的推理轨迹上进行监督微调 2. 组相对策略优化（GRPO） 3. 带有动态缩放反馈的多轮视觉定位（如果名称中包含“Multiturn”）

使用场景

此模型在需要精确视觉定位和区域级推理的视觉推理任务中表现出色。具体领域请参阅模型名称。

空间推理：SAT - 2、BLINK、RoboSpatial
视觉搜索：V*Bench
网页交互和定位：ScreenSpot（Pro 和 V2）、VisualWebArena

技术细节

本模型在论文 "Grounded Reinforcement Learning for Visual Reasoning" 中被提出。作者：Gabriel Sarch、Snigdha Saha、Naitik Khandelwal、Ayush Jain、Michael J. Tarr、Aviral Kumar、Katerina Fragkiadaki

数据集和训练数据

训练数据集和生成的推理链是公开可用的：

引用

如果您在研究或应用中使用 ViGoRL，请引用我们的论文：

@article{sarch2025vigorl,
    title={Grounded Reinforcement Learning for Visual Reasoning},
    author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina},
    year={2025}
}