许可证:apache-2.0
语言:
- 英文
任务标签:图像文本到文本
标签:
- 多模态
- 图形用户界面
库名称:transformers
UI-TARS-7B-DPO
UI-TARS-2B-SFT |
UI-TARS-7B-SFT |
UI-TARS-7B-DPO(推荐) |
UI-TARS-72B-SFT |
UI-TARS-72B-DPO(推荐)
简介
UI-TARS 是新一代原生图形用户界面(GUI)代理模型,旨在通过类人的感知、推理和行动能力与图形用户界面无缝交互。与传统模块化框架不同,UI-TARS 将所有关键组件——感知、推理、定位和记忆——集成在单一视觉语言模型(VLM)中,实现无需预定义工作流或手动规则的端到端任务自动化。
本仓库包含论文 UI-TARS: Pioneering Automated GUI Interaction with Native Agents 的模型。
代码:https://github.com/bytedance/UI-TARS
性能
感知能力评估
模型 |
VisualWebBench |
WebSRC |
SQAshort |
Qwen2-VL-7B |
73.3 |
81.8 |
84.9 |
Qwen-VL-Max |
74.1 |
91.1 |
78.6 |
Gemini-1.5-Pro |
75.4 |
88.9 |
82.2 |
UIX-Qwen2-7B |
75.9 |
82.9 |
78.8 |
Claude-3.5-Sonnet |
78.2 |
90.4 |
83.1 |
GPT-4o |
78.5 |
87.7 |
82.3 |
UI-TARS-2B |
72.9 |
89.2 |
86.4 |
UI-TARS-7B |
79.7 |
93.6 |
87.7 |
UI-TARS-72B |
82.8 |
89.3 |
88.6 |
定位能力评估
代理模型 |
Dev-文本 |
Dev-图标 |
Dev-平均 |
Creative-文本 |
Creative-图标 |
Creative-平均 |
CAD-文本 |
CAD-图标 |
CAD-平均 |
Scientific-文本 |
Scientific-图标 |
Scientific-平均 |
Office-文本 |
Office-图标 |
Office-平均 |
OS-文本 |
OS-图标 |
OS-平均 |
平均-文本 |
平均-图标 |
平均 |
QwenVL-7B |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.7 |
0.0 |
0.4 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.1 |
0.0 |
0.1 |
GPT-4o |
1.3 |
0.0 |
0.7 |
1.0 |
0.0 |
0.6 |
2.0 |
0.0 |
1.5 |
2.1 |
0.0 |
1.2 |
1.1 |
0.0 |
0.9 |
0.0 |
0.0 |
0.0 |
1.3 |
0.0 |
0.8 |
SeeClick |
0.6 |
0.0 |
0.3 |
1.0 |
0.0 |
0.6 |
2.5 |
0.0 |
1.9 |
3.5 |
0.0 |
2.0 |
1.1 |
0.0 |
0.9 |
2.8 |
0.0 |
1.5 |
1.8 |
0.0 |
1.1 |
Qwen2-VL-7B |
2.6 |
0.0 |
1.3 |
1.5 |
0.0 |
0.9 |
0.5 |
0.0 |
0.4 |
6.3 |
0.0 |
3.5 |
3.4 |
1.9 |
3.0 |
0.9 |
0.0 |
0.5 |
2.5 |
0.2 |
1.6 |
OS-Atlas-4B |
7.1 |
0.0 |
3.7 |
3.0 |
1.4 |
2.3 |
2.0 |
0.0 |
1.5 |
9.0 |
5.5 |
7.5 |
5.1 |
3.8 |
4.8 |
5.6 |
0.0 |
3.1 |
5.0 |
1.7 |
3.7 |
ShowUI-2B |
16.9 |
1.4 |
9.4 |
9.1 |
0.0 |
5.3 |
2.5 |
0.0 |
1.9 |
13.2 |
7.3 |
10.6 |
15.3 |
7.5 |
13.5 |
10.3 |
2.2 |
6.6 |
10.8 |
2.6 |
7.7 |
CogAgent-18B |
14.9 |
0.7 |
8.0 |
9.6 |
0.0 |
5.6 |
7.1 |
3.1 |
6.1 |
22.2 |
1.8 |
13.4 |
13.0 |
0.0 |
10.0 |
5.6 |
0.0 |
3.1 |
12.0 |
0.8 |
7.7 |
Aria-UI |
16.2 |
0.0 |
8.4 |
23.7 |
2.1 |
14.7 |
7.6 |
1.6 |
6.1 |
27.1 |
6.4 |
18.1 |
20.3 |
1.9 |
16.1 |
4.7 |
0.0 |
2.6 |
17.1 |
2.0 |
11.3 |
UGround-7B |
26.6 |
2.1 |
14.7 |
27.3 |
2.8 |
17.0 |
14.2 |
1.6 |
11.1 |
31.9 |
2.7 |
19.3 |
31.6 |
11.3 |
27.0 |
17.8 |
0.0 |
9.7 |
25.0 |
2.8 |
16.5 |
Claude Computer Use |
22.0 |
3.9 |
12.6 |
25.9 |
3.4 |
16.8 |
14.5 |
3.7 |
11.9 |
33.9 |
15.8 |
25.8 |
30.1 |
16.3 |
26.9 |
11.0 |
4.5 |
8.1 |
23.4 |
7.1 |
17.1 |
OS-Atlas-7B |
33.1 |
1.4 |
17.7 |
28.8 |
2.8 |
17.9 |
12.2 |
4.7 |
10.3 |
37.5 |
7.3 |
24.4 |
33.9 |
5.7 |
27.4 |
27.1 |
4.5 |
16.8 |
28.1 |
4.0 |
18.9 |
UGround-V1-7B |
- |
- |
35.5 |
- |
- |
27.8 |
- |
- |
13.5 |
- |
- |
38.8 |
- |
- |
48.8 |
- |
- |
26.1 |
- |
- |
31.1 |
UI-TARS-2B |
47.4 |
4.1 |
26.4 |
42.9 |
6.3 |
27.6 |
17.8 |
4.7 |
14.6 |
56.9 |
17.3 |
39.8 |
50.3 |
17.0 |
42.6 |
21.5 |
5.6 |
14.3 |
39.6 |
8.4 |
27.7 |
UI-TARS-7B |
58.4 |
12.4 |
36.1 |
50.0 |
9.1 |
32.8 |
20.8 |
9.4 |
18.0 |
63.9 |
31.8 |
50.0 |
63.3 |
20.8 |
53.5 |
30.8 |
16.9 |
24.5 |
47.8 |
16.2 |
35.7 |
UI-TARS-72B |
63.0 |
17.3 |
40.8 |
57.1 |
15.4 |
39.6 |
18.8 |
12.5 |
17.2 |
64.6 |
20.9 |
45.7 |
63.3 |
26.4 |
54.8 |
42.1 |
15.7 |
30.1 |
50.9 |
17.5 |
38.1 |
方法 |
移动端-文本 |
移动端-图标/组件 |
桌面端-文本 |
桌面端-图标/组件 |
网页-文本 |
网页-图标/组件 |
平均 |
代理框架 |
|
|
|
|
|
|
|
GPT-4 (SeeClick) |
76.6 |
55.5 |
68.0 |
28.6 |
40.9 |
23.3 |
48.8 |
GPT-4 (OmniParser) |
93.9 |
57.0 |
91.3 |
63.6 |
81.3 |
51.0 |
73.0 |
GPT-4 (UGround-7B) |
90.1 |
70.3 |
87.1 |
55.7 |
85.7 |
64.6 |
75.6 |
GPT-4o (SeeClick) |
81.0 |
59.8 |
69.6 |
33.6 |
43.9 |
26.2 |
52.3 |
GPT-4o (UGround-7B) |
93.4 |
76.9 |
92.8 |
67.9 |
88.7 |
68.9 |
81.4 |
代理模型 |
|
|
|
|
|
|
|
GPT-4 |
22.6 |
24.5 |
20.2 |
11.8 |
9.2 |
8.8 |
16.2 |
GPT-4o |
20.2 |
24.9 |
21.1 |
23.6 |
12.2 |
7.8 |
18.3 |
CogAgent |
67.0 |
24.0 |
74.2 |
20.0 |
70.4 |
28.6 |
47.4 |
SeeClick |
78.0 |
52.0 |
72.2 |
30.0 |
55.7 |
32.5 |
53.4 |
Qwen2-VL |
75.5 |
60.7 |
76.3 |
54.3 |
35.2 |
25.7 |
55.3 |
UGround-7B |
82.8 |
60.3 |
82.5 |
63.6 |
80.4 |
70.4 |
73.3 |
Aguvis-G-7B |
88.3 |
78.2 |
88.1 |
70.7 |
85.7 |
74.8 |
81.8 |
OS-Atlas-7B |
93.0 |
72.9 |
91.8 |
62.9 |
90.9 |
74.3 |
82.5 |
Claude Computer Use |
- |
- |
- |
- |
- |
- |
83.0 |
Gemini 2.0 (Project Mariner) |
- |
- |
- |
- |
- |
- |
84.0 |
Aguvis-7B |
95.6 |
77.7 |
93.8 |
67.1 |
88.3 |
75.2 |
84.4 |
Aguvis-72B |
94.5 |
85.2 |
95.4 |
77.9 |
91.3 |
85.9 |
89.2 |
我们的模型 |
|
|
|
|
|
|
|
UI-TARS-2B |
93.0 |
75.5 |
90.7 |
68.6 |
84.3 |
74.8 |
82.3 |
UI-TARS-7B |
94.5 |
85.2 |
95.9 |
85.7 |
90.0 |
83.5 |
89.5 |
UI-TARS-72B |
94.9 |
82.5 |
89.7 |
88.6 |
88.7 |
85.0 |
88.4 |
方法 |
移动端-文本 |
移动端-图标/组件 |
桌面端-文本 |
桌面端-图标/组件 |
网页-文本 |
网页-图标/组件 |
平均 |
代理框架 |
|
|
|
|
|
|
|
GPT-4o (SeeClick) |
85.2 |
58.8 |
79.9 |
|
|
|
|