开源GUI-Actor-2B-Qwen2-VL模型，精准完成图形用户界面定位任务

首页

GUI Actor 2B Qwen2 VL

由 microsoft 开发

GUI-Actor-2B是基于Qwen2-VL-2B的视觉语言模型，专为图形用户界面(GUI)定位任务设计，通过增加基于注意力的动作头并进行微调，在多个GUI定位基准测试中表现良好。

文本生成图像

Transformers

开源协议:MIT #GUI定位 #视觉语言模型 #注意力动作头

下载量 163

发布时间 : 6/1/2025

模型简介

该模型主要用于执行图形用户界面的定位任务，能够根据屏幕截图和指令预测操作位置。

模型特点

基于Qwen2-VL骨干模型

以强大的Qwen2-VL-2B视觉语言模型为基础，具备优秀的视觉理解能力

专用动作头设计

增加基于注意力的动作头，专门优化GUI定位任务

多基准测试表现优异

在ScreenSpot-Pro、ScreenSpot和ScreenSpot-v2等多个GUI定位基准上取得领先成绩

模型能力

GUI元素定位

视觉语言理解

屏幕指令理解

操作点预测

使用案例

自动化测试

GUI元素定位

根据指令自动定位屏幕上的特定元素

在ScreenSpot-Pro上达到36.7%的准确率

辅助工具

无障碍操作辅助

帮助视障用户通过语音指令操作图形界面

🚀 GUI-Actor-2B：以Qwen2-VL-2B为骨干视觉语言模型

GUI-Actor-2B是一个基于视觉语言模型的项目，它以Qwen2-VL-2B为骨干模型，通过增加基于注意力的动作头并进行微调，用于执行图形用户界面（GUI）的定位任务。该模型在多个GUI定位基准测试中展现出了良好的性能。

🚀 快速开始

模型信息

属性	详情
基础模型	Qwen/Qwen2-VL-2B-Instruct
许可证	MIT
库名称	transformers
任务类型	图像文本到文本

模型版本

模型名称	Hugging Face链接
GUI-Actor-7B-Qwen2-VL	Hugging Face
GUI-Actor-2B-Qwen2-VL	Hugging Face
GUI-Actor-7B-Qwen2.5-VL	Hugging Face
GUI-Actor-3B-Qwen2.5-VL	Hugging Face
GUI-Actor-Verifier-2B	Hugging Face

✨ 主要特性

GUI-Actor-2B以Qwen2-VL-2B为骨干视觉语言模型，通过增加基于注意力的动作头并进行微调，能够在图形用户界面（GUI）定位任务中取得较好的效果。

📊 性能对比

以Qwen2-VL为骨干的模型在ScreenSpot-Pro、ScreenSpot和ScreenSpot-v2上的主要结果

表1展示了以Qwen2-VL为骨干的模型在不同数据集上的性能表现。‚Ä† 表示我们对Huggingface上官方模型进行评估得到的分数。

方法	骨干视觉语言模型	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B模型:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B模型:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + 验证器	Qwen2-VL	44.2	89.7	90.9
*2B模型:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + 验证器	Qwen2-VL	41.8	86.9	89.3

以Qwen2.5-VL为骨干的模型在ScreenSpot-Pro和ScreenSpot-v2上的主要结果

表2展示了以Qwen2.5-VL为骨干的模型在不同数据集上的性能表现。

方法	骨干视觉语言模型	ScreenSpot-Pro	ScreenSpot-v2
*7B模型:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + 验证器	Qwen2.5-VL	47.7	92.5
*3B模型:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + 验证器	Qwen2.5-VL	45.9	92.4

💻 使用示例

基础用法

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# 加载模型
model_name_or_path = "microsoft/GUI-Actor-2B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# 准备示例
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"指令: {example['instruction']}")
print(f"真实动作区域 (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "你是一个GUI代理。你被分配了一个任务和屏幕的截图。你需要执行一系列pyautogui动作来完成任务。",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image 或图像路径字符串
                # "image_url": "https://xxxxx.png" 或 "https://xxxxx.jpg" 或 "file://xxxxx.png" 或 "data:image/png;base64,xxxxxxxx"，将按 "base64," 分割
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# 推理
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"预测的点击点: [{round(px, 4)}, {round(py, 4)}]")

# >> 模型响应
# 指令: 关闭此窗口
# 真实动作区域 (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# 预测的点击点: [0.9709, 0.1548]

📚 详细文档

请参考项目页面和 GitHub仓库以获取更多详细信息。

📄 许可证

本项目采用MIT许可证。

📚 引用

如果您在研究中使用了本模型，请引用以下论文：

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}