Infigui R1 3B

由 Reallm-Labs 开发

基于Qwen2.5-VL-3B-Instruct的多模态GUI智能体，通过强化学习增强在图形用户界面任务中的规划和反思能力

图像生成文本

Transformers

支持多种语言

开源协议:Apache-2.0 #GUI智能体 #多模态推理 #强化学习优化

下载量 105

发布时间 : 4/19/2025

模型介绍

内容详情

替代品

模型简介

该模型专注于图形用户界面(GUI)任务，能够理解界面元素并执行交互操作，具有规划和反思能力

模型特点

GUI交互能力

能够理解和操作图形用户界面元素，执行点击、滑动等交互操作

规划与反思

通过Actor2Reasoner框架增强任务规划和执行反思能力

多模态理解

同时处理图像和文本输入，理解界面元素及其功能

模型能力

GUI元素定位

界面操作轨迹规划

多模态推理

任务执行反思

使用案例

移动应用测试

自动化UI测试

自动执行移动应用界面测试流程

可识别界面元素并执行预定操作序列

辅助功能

视觉障碍辅助

帮助视觉障碍用户理解和操作界面

可描述界面元素并指导用户操作

基础模型:

Qwen/Qwen2.5-VL-3B-Instruct 语言:
英文许可证: apache-2.0 标签:
图形用户界面
智能体管道标签: 图像文本到文本库名称: transformers

InfiGUI-R1-3B

本仓库包含来自InfiGUI-R1论文的模型。该模型基于Qwen2.5-VL-3B-Instruct，并使用提出的Actor2Reasoner框架进行训练，通过强化学习增强其在图形用户界面任务中的规划和反思能力。

快速开始

安装

首先安装必要的依赖项：

pip install transformers qwen-vl-utils

图形用户界面定位与轨迹任务示例

import cv2
import json
import torch
import requests
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info, smart_resize

MAX_IMAGE_PIXELS = 5600*28*28

# 加载模型和处理器
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Reallm-Labs/InfiGUI-R1-3B", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUI-R1-3B", max_pixels=MAX_IMAGE_PIXELS, padding_side="left")

# 准备图像
img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUI-R1/main/images/test_img.png"
response = requests.get(img_url)
with open("test_img.png", "wb") as f:
    f.write(response.content)
image = Image.open("test_img.png")
width, height = image.size
new_height, new_width = smart_resize(height, width, max_pixels=MAX_IMAGE_PIXELS)

# 准备输入
instruction = "查看详细的存储空间使用情况"

system_prompt = '你首先以内心独白的方式思考推理过程，然后提供最终答案。\n推理过程必须包含在<think></think>标签内。'
## 以下提示主要来源于https://github.com/QwenLM/Qwen2.5-VL
tool_prompt = "# 工具\n\n你可以调用一个或多个函数来协助处理用户查询。\n\n在<tools></tools> XML标签内提供了函数签名：\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"mobile_use\", \"description\": \"使用触摸屏与移动设备交互，并截取屏幕截图。\\n* 这是一个带有触摸屏的移动设备接口。你可以执行点击、输入、滑动等操作。\\n* 某些应用程序可能需要时间启动或处理操作，因此你可能需要等待并连续截取屏幕截图以查看操作结果。\\n* 屏幕分辨率为" + str(new_width) + "x" + str(new_height) + "。\\n* 确保点击按钮、链接、图标等时，光标尖端位于元素的中心。除非被要求，否则不要点击边缘的框。\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"要执行的操作。可用操作包括：\\n* `key`: 在移动设备上执行按键事件。\\n    - 支持adb的`keyevent`语法。\\n    - 示例：\\\"volume_up\\\", \\\"volume_down\\\", \\\"power\\\", \\\"camera\\\", \\\"clear\\\"。\\n* `click`: 点击屏幕上坐标为(x, y)的点。\\n* `long_press`: 在坐标为(x, y)的点上按压指定秒数。\\n* `swipe`: 从坐标为(x, y)的起始点滑动到坐标为(x2, y2)的终点。\\n* `type`: 将指定文本输入到激活的输入框中。\\n* `system_button`: 按下系统按钮。\\n* `open`: 在设备上打开一个应用。\\n* `wait`: 等待指定秒数以观察变化。\\n* `terminate`: 终止当前任务并报告其完成状态。\", \"enum\": [\"key\", \"click\", \"long_press\", \"swipe\", \"type\", \"system_button\", \"open\", \"wait\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): x（距左边缘的像素数）和y（距上边缘的像素数）坐标。仅在`action=click`、`action=long_press`和`action=swipe`时需要。\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): x（距左边缘的像素数）和y（距上边缘的像素数）坐标。仅在`action=swipe`时需要。\", \"type\": \"array\"}, \"text\": {\"description\": \"仅在`action=key`、`action=type`和`action=open`时需要。\", \"type\": \"string\"}, \"time\": {\"description\": \"等待的秒数。仅在`action=long_press`和`action=wait`时需要。\", \"type\": \"number\"}, \"button\": {\"description\": \"Back表示返回上一界面，Home表示返回桌面，Menu表示打开应用后台菜单，Enter表示按下回车键。仅在`action=system_button`时需要。\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"任务状态。仅在`action=terminate`时需要。\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\n对于每个函数调用，返回一个包含函数名称和参数的json对象，放在<tool_call></tool_call> XML标签内：\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>"
grounding_prompt = f'屏幕分辨率为{new_width}x{new_height}。\n指向与"{instruction}"最相关的UI元素，使用JSON格式输出其坐标：\n```json\n[\n    {{"point_2d": [x, y], "label": "对象名称/描述"}}\n]```'
trajectory_prompt = f'用户查询：{instruction}\n任务进度（你已在当前设备上执行了以下操作）：'

# 构建消息
grounding_messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "test_img.png"},
            {"type": "text", "text": grounding_prompt}
        ]
    }
]
trajectory_messages = [
    {"role": "system", "content": system_prompt + "\n\n" + tool_prompt},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": trajectory_prompt},
            {"type": "image", "image": "test_img.png"}
        ],
    },
]
messages = [grounding_messages, trajectory_messages]

# 处理并生成
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=512)
output_text = processor.batch_decode(
    [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

# 可视化结果
output_text = [ot.split("</think>")[-1] for ot in output_text]

grounding_output = output_text[0].replace("```json", "").replace("```", "").strip()
trajectory_output = output_text[1].replace("<tool_call>", "").replace("</tool_call>", "").strip()

try:
    grounding_output = json.loads(grounding_output)
    trajectory_output = json.loads(trajectory_output)

    grounding_coords = grounding_output[0]['point_2d']
    trajectory_coords = trajectory_output["arguments"]['coordinate'] if "coordinate" in trajectory_output["arguments"] else None

    grounding_label = grounding_output[0]['label']
    trajectory_label = json.dumps(trajectory_output["arguments"])

    # 加载原始图像
    img = cv2.imread("test_img.png")
    if img is None:
        raise ValueError("无法加载图像")
    
    height, width = img.shape[:2]
    
    # 为每个可视化创建副本
    grounding_img = img.copy()
    trajectory_img = img.copy()
    
    # 可视化定位坐标
    if grounding_coords:
        x = int(grounding_coords[0] / new_width * width)
        y = int(grounding_coords[1] / new_height * height)
        
        cv2.circle(grounding_img, (x, y), 10, (0, 0, 255), -1)
        cv2.putText(grounding_img, grounding_label, (x+10, y-10),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
        cv2.imwrite("grounding_output.png", grounding_img)
        print("预测坐标：", grounding_coords)
        print(f"定位可视化已保存到grounding_output.png")
    
    # 可视化轨迹坐标
    if trajectory_coords:
        x = int(trajectory_coords[0] / new_width * width)
        y = int(trajectory_coords[1] / new_height * height)
        
        cv2.circle(trajectory_img, (x, y), 10, (0, 0, 255), -1)
        cv2.putText(trajectory_img, trajectory_label, (x+10, y-10),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
        cv2.imwrite("trajectory_output.png", trajectory_img)
        print("预测操作：", trajectory_label)
        print(f"轨迹可视化已保存到trajectory_output.png")

except:
    print("错误：无法解析坐标或处理图像")

更多信息，请参考我们的仓库。

引用信息

如果你觉得这项工作有用，我们非常感谢你引用以下论文：

@article{liu2025infigui,
  title={InfiGUI-R1: 从反应式执行者到深思熟虑推理者的多模态GUI智能体进阶},
  author={刘宇航 and 李鹏翔 and 谢聪凯 and 胡晓旭 and 韩晓天 and 张胜宇 and 杨红霞 and 吴飞},
  journal={arXiv预印本 arXiv:2504.14239},
  year={2025}
}

@article{liu2025infiguiagent,
  title={InfiGUIAgent: 具有原生推理和反思能力的多模态通用GUI智能体},
  author={刘宇航 and 李鹏翔 and 魏子舒 and 谢聪凯 and 胡雪宇 and 徐新晨 and 张胜宇 and 韩晓天 and 杨红霞 and 吴飞},
  journal={arXiv预印本 arXiv:2501.04575},
  year={2025}
}