许可证: mit
语言:
- 英文
基础模型:
- Qwen/Qwen2.5-VL-3B-Instruct
管道标签: 视觉问答
简介
本仓库包含高效GUI定位模型UI-R1-E-3B,出自论文UI-R1: 通过强化学习增强GUI代理的行为预测。
项目主页: https://github.com/lll6gg/UI-R1
旧版本: UI-R1-3B
基准测试1: ScreenSpotV2
ScreenSpotV2 |
推理模式 |
Mobile-T |
Mobile-I |
Desktop-T |
Desktop-I |
Web-T |
Web-I |
平均↑ / 长度↓ |
OS-ATLAS-7B |
无思考过程 |
95.2 |
75.8 |
90.7 |
63.6 |
90.6 |
77.3 |
84.1 / |
UI-TARS-7B |
无思考过程 |
95.2 |
79.1 |
90.7 |
68.6 |
90.6 |
78.3 |
84.7 / |
UI-R1-3B (v1) |
带思考过程 |
96.2 |
84.3 |
92.3 |
63.6 |
89.2 |
75.4 |
85.4 / 67 |
GUI-R1-3B |
带思考过程 |
97.6 |
78.2 |
94.3 |
64.3 |
91.0 |
72.4 |
85.0 / 80 |
UI-R1-3B (v2) |
带思考过程 |
97.6 |
79.6 |
92.3 |
67.9 |
88.9 |
77.8 |
85.8 / 60 |
UI-R1-E-3B |
无思考过程 |
98.2 |
83.9 |
94.8 |
75.0 |
93.2 |
83.7 |
89.5 / 28 |
基准测试2: ScreenSpot-Pro
ScreenSpot-Pro |
推理模式 |
平均长度↓ |
平均准确率↑ |
UGround-7B |
无思考过程 |
- |
16.5 |
OS-ATLAS-7B |
无思考过程 |
- |
18.9 |
UI-R1-3B (v1) |
带思考过程 |
102 |
17.8 |
GUI-R1-3B |
带思考过程 |
114 |
26.6 |
UI-R1-3B (v2) |
带思考过程 |
129 |
29.8 |
UI-R1-E-3B |
无思考过程 |
28 |
33.5 |
排行榜: UI-I2E-Bench
模型 |
ScreenSpot |
UI-I2E-Bench平均 |
ScreenSpot-Pro |
平均 |
UI-TARS-1.5-7B |
88.1 |
73.2 |
42.2 |
67.8 |
Uground-V1-72B |
89.7 |
76.3 |
34.3 |
66.8 |
UI-TARS-72B |
88.4 |
73.7 |
38.1 |
66.7 |
UI-R1-E-3B |
89.2 |
69.1 |
33.5 |
63.9 |
Uground-V1-7B |
87.1 |
70.3 |
31.1 |
62.8 |
InfiGUI-R1 |
87.5 |
69.7 |
29.6 |
62.3 |
UI-TARS-7B |
89.5 |
61.4 |
35.7 |
62.2 |
Qwen2.5-VL-72B |
87.1 |
51.4 |
43.6 |
60.7 |
UI-I2E-VLM-7B |
82.5 |
69.5 |
23.6 |
58.5 |
UI-TARS-2B |
82.3 |
62 |
27.7 |
57.3 |
Qwen2.5-VL-7B |
84.7 |
53.8 |
29 |
55.8 |
OmniParser-V2 |
72 |
54.8 |
39.6 |
55.5 |
Uground-V1-2B |
78.8 |
57.4 |
26.6 |
54.3 |
OS-Atlas-7B |
82.5 |
58.6 |
18.9 |
53.3 |
UI-R1-3B |
83.3 |
58.5 |
17.8 |
53.2 |
UGround-7B |
74.1 |
54.2 |
16.5 |
48.3 |
UI-I2E-VLM-4B |
70.4 |
53.4 |
12.2 |
45.3 |
OmniParser |
73.9 |
53.1 |
8.3 |
45.1 |
ShowUI-2B |
76.8 |
41.5 |
7.7 |
42 |
Qwen2.5-VL-3B |
55.5 |
41.7 |
23.9 |
41.3 |
Aguvis-7B |
84.4 |
53.2 |
22.9 |
40.4 |
OS-Atlas-4B |
70.1 |
44.3 |
3.7 |
39.4 |
Qwen2-VL-7B |
42.6 |
48.7 |
1.6 |
31 |
Seeclick |
55.8 |
26.4 |
1.1 |
27.8 |
InternVL2-4B |
4.2 |
0.9 |
0.3 |
1.8 |
GUI定位评估代码
-
UI-R1-E-3B生成代码:
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
args.model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cpu",
)
model = model.to(torch.device(rank))
model = model.eval()
processor = AutoProcessor.from_pretrained(ori_processor_path)
question_template = (
f"在此UI截图中,我想执行命令'{task_prompt}'。\n"
"请提供要执行的操作(枚举['click'])"
"以及如果执行点击时光标移动到的坐标(整数)。\n"
"直接在<answer></answer>标签中输出最终答案。"
"输出答案格式应如下:\n"
"<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
"请严格遵守格式。"
)
query = '<image>\n' + question_template
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path}
] + [{"type": "text", "text": query}],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
response = response[0]
pred_coord, _ = extract_coord(response)
-
根据图像缩放比例调整预测坐标
image = Image.open(image_path)
origin_width, origin_height = image.size
resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
scale_x = origin_width / resized_width
scale_y = origin_height / resized_height
pred_coord[0] = int(pred_coord[0] * scale_x)
pred_coord[1] = int(pred_coord[1] * scale_y)
智能缩放函数来自Qwen2VL:
import math
def smart_resize(
height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
"""缩放图像以满足以下条件:
1. 高度和宽度都能被'factor'整除
2. 总像素数在['min_pixels', 'max_pixels']范围内
3. 尽可能保持图像宽高比
"""
if height < factor or width < factor:
raise ValueError(f"高度:{height}或宽度:{width}必须大于因子:{factor}")
elif max(height, width) / min(height, width) > 200:
raise ValueError(
f"绝对宽高比必须小于200,当前为{max(height, width) / min(height, width)}"
)
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = math.floor(height / beta / factor) * factor
w_bar = math.floor(width / beta / factor) * factor
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = math.ceil(height * beta / factor) * factor
w_bar = math.ceil(width * beta / factor) * factor
return h_bar, w_bar