license: llama2
pipeline_tag: image-text-to-text
UGround (基于LLaVA的初始版本)
更新:我们已使用相同数据基于Qwen2-VL训练了更强模型。建议使用新模型以获得更优性能及更便捷的训练/推理/部署体验。
UGround是通过简洁配方训练的强力GUI视觉定位模型。详情请参阅项目主页与论文。本工作由俄亥俄州立大学NLP组与Orby AI合作完成。

- 主页: https://osu-nlp-group.github.io/UGround/
- 代码库: https://github.com/OSU-NLP-Group/UGround
- 论文: https://arxiv.org/abs/2410.05243
- 演示: https://huggingface.co/spaces/orby-osu/UGround
- 联系人: Boyu Gou
模型列表
发布计划
- [x] 模型权重
- [x] 初始版本(论文所用)
- [x] 基于Qwen2-VL的V1系列(2B/7B/72B)
- [x] 代码
- [x] 训练数据(V1)
- [x] 在线演示(HF Spaces)
主要结果
GUI视觉定位:ScreenSpot(标准设置)
定位模型 |
架构 |
微调数据 |
移动端文本 |
移动端图标 |
桌面端文本 |
桌面端图标 |
网页文本 |
网页图标 |
平均 |
GPT-4 |
|
|
22.6 |
24.5 |
20.2 |
11.8 |
9.2 |
8.8 |
16.2 |
GPT-4o |
|
|
20.2 |
24.9 |
21.1 |
23.6 |
12.2 |
7.8 |
18.3 |
MiniGPT-v2 |
MiniGPT-v2 |
|
8.4 |
6.6 |
6.2 |
2.9 |
6.5 |
3.4 |
5.7 |
Groma |
Groma |
|
10.3 |
2.6 |
4.6 |
4.3 |
5.7 |
3.4 |
5.2 |
Fuyu |
Fuyu |
|
41.0 |
1.3 |
33.0 |
3.6 |
33.9 |
4.4 |
19.5 |
Qwen-VL |
Qwen-VL |
|
9.5 |
4.8 |
5.7 |
5.0 |
3.5 |
2.4 |
5.2 |
SeeClick |
Qwen-VL |
SeeClick |
78.0 |
52.0 |
72.2 |
30.0 |
55.7 |
32.5 |
53.4 |
Qwen-GUI |
Qwen-VL |
GUICourse |
52.4 |
10.9 |
45.9 |
5.7 |
43.0 |
13.6 |
28.6 |
UGround-V1 |
LLaVA-UGround-V1 |
UGround-V1 |
82.8 |
60.3 |
82.5 |
63.6 |
80.4 |
70.4 |
73.3 |
Qwen2-VL |
Qwen2-VL |
|
61.3 |
39.3 |
52.0 |
45.0 |
33.0 |
21.8 |
42.1 |
Auguvis-G-7B |
Qwen2-VL |
Aguvis-Stage-1 |
88.3 |
78.2 |
88.1 |
70.7 |
85.7 |
74.8 |
81.0 |
Auguvis-7B |
Qwen2-VL |
Aguvis-Stage-1&2 |
95.6 |
77.7 |
93.8 |
67.1 |
88.3 |
75.2 |
83.0 |
OS-Atlas-Base-4B |
InternVL |
OS-Atlas |
85.7 |
58.5 |
72.2 |
45.7 |
82.6 |
63.1 |
68.0 |
OS-Atlas-Base-7B |
Qwen2-VL |
OS-Atlas |
93.0 |
72.9 |
91.8 |
62.9 |
90.9 |
74.3 |
81.0 |
ShowUI-G |
ShowUI |
ShowUI |
91.6 |
69.0 |
81.8 |
59.0 |
83.0 |
65.5 |
75.0 |
ShowUI |
ShowUI |
ShowUI |
92.3 |
75.5 |
76.3 |
61.1 |
81.7 |
63.6 |
75.1 |
Iris |
Iris |
SeeClick |
85.3 |
64.2 |
86.7 |
57.5 |
82.6 |
71.2 |
74.6 |
Aria-UI |
Aria |
Aria-UI |
92.3 |
73.8 |
93.3 |
64.3 |
86.5 |
76.2 |
81.1 |
UGround-V1-2B |
Qwen2-VL |
UGround-V1 |
89.4 |
72.0 |
88.7 |
65.7 |
81.3 |
68.9 |
77.7 |
UGround-V1-7B |
Qwen2-VL |
UGround-V1 |
93.0 |
79.9 |
93.8 |
76.4 |
90.9 |
84.0 |
86.3 |
GUI视觉定位:ScreenSpot(智能体设置)
规划器 |
定位模型 |
架构 |
微调数据 |
移动端文本 |
移动端图标 |
桌面端文本 |
桌面端图标 |
网页文本 |
网页图标 |
平均 |
GPT-4o |
Qwen-VL |
Qwen-VL |
|
21.3 |
21.4 |
18.6 |
10.7 |
9.1 |
5.8 |
14.5 |
GPT-4o |
SeeClick |
Qwen-VL |
SeeClick |
81.0 |
59.8 |
69.6 |
33.6 |
43.9 |
26.2 |
52.4 |
GPT-4o |
Qwen-GUI |
Qwen-VL |
GUICourse |
67.8 |
24.5 |
53.1 |
16.4 |
50.4 |
18.5 |
38.5 |
GPT-4o |
UGround-V1 |
LLaVA-UGround-V1 |
UGround-V1 |
93.4 |
76.9 |
92.8 |
67.9 |
88.7 |
68.9 |
81.4 |
GPT-4o |
OS-Atlas-Base-4B |
InternVL |
OS-Atlas |
94.1 |
73.8 |
77.8 |
47.1 |
86.5 |
65.3 |
74.1 |
GPT-4o |
OS-Atlas-Base-7B |
Qwen2-VL |
OS-Atlas |
93.8 |
79.9 |
90.2 |
66.4 |
92.6 |
79.1 |
83.7 |
GPT-4o |
UGround-V1-2B |
Qwen2-VL |
UGround-V1 |
94.1 |
77.7 |
92.8 |
63.6 |
90.0 |
70.9 |
81.5 |
GPT-4o |
UGround-V1-7B |
Qwen2-VL |
UGround-V1 |
94.1 |
79.9 |
93.3 |
73.6 |
89.6 |
73.3 |
84.0 |

引用信息
若此工作对您有帮助,请考虑引用我们的论文:
@article{gou2024uground,
title={像人类一样探索数字世界:通用GUI智能体的视觉定位技术},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv预印本 arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision)作为通用网页智能体的潜力——如果实现精准定位},
author={Boyuan Zheng and Boyu