首页

Uground V1 7B

由 osunlp 开发

UGround是一款采用简单配方训练的强大GUI视觉定位模型，由OSU NLP Group与Orby AI合作完成。

图像生成文本

Transformers

英语开源协议:Apache-2.0 #GUI视觉定位 #多模态交互 #动态分辨率处理

下载量 2,053

发布时间 : 1/3/2025

模型简介

UGround是一款基于Qwen2-VL的GUI视觉定位模型，专注于精确定位屏幕上特定区域/元素/对象的坐标。

模型特点

多模态视觉定位

能够精确定位屏幕上特定区域/元素/对象的坐标(x,y)。

高性能

在ScreenSpot基准测试中表现优异，平均分达到86.3。

智能体集成

可集成手机/机器人等设备实现视觉环境下的自动操作。

模型能力

GUI视觉定位

多模态理解

智能体操作

使用案例

GUI视觉定位

ScreenSpot基准测试

在标准设置下进行GUI视觉定位测试

平均分86.3，在多个子任务中表现优异

智能体设置

与GPT-4o规划器结合使用

平均分84.0，在移动端和桌面端表现突出

🚀 UGround-V1-7B （基于Qwen2-VL）

UGround是一个强大的GUI视觉定位模型，采用简单的方法进行训练。更多详细信息请查看我们的主页和论文。本项目是俄亥俄州立大学自然语言处理小组和Orby AI的合作成果。雷达图

主页：https://osu-nlp-group.github.io/UGround/
代码仓库：https://github.com/OSU-NLP-Group/UGround
论文（ICLR'25口头报告）：https://arxiv.org/abs/2410.05243
演示：https://huggingface.co/spaces/orby-osu/UGround
联系人：苟博宇

✨ 主要特性

强大的GUI视觉定位能力：在多个GUI视觉定位任务中表现出色，如ScreenSpot等。
多模型版本：提供不同参数规模的模型版本，包括2B、7B和72B。
丰富的实验支持：涵盖离线和在线实验，提供推理代码和实验结果。
数据合成管道：即将推出数据合成管道，方便用户进行数据生成。

📦 安装指南

可参考Qwen2-VL的官方仓库获取更多训练和推理的说明。

💻 使用示例

vLLM服务器

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

或者

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

视觉定位提示

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

示例图片

📚 详细文档

模型

模型V1：

发布计划

[x] 模型权重
- [x] 初始版本（论文中使用的版本）
- [x] 基于Qwen2-VL的V1版本
  - [x] 2B
  - [x] 7B
  - [x] 72B
[x] 代码
- [x] UGround的推理代码（初始版本和基于Qwen2-VL的版本）
- [x] 离线实验（代码、结果和有用资源）
  - [x] ScreenSpot（以及由GPT-4/4o生成的指代表达）
  - [x] 多模态Mind2Web
  - [x] OmniAct
  - [x] 安卓控制
- [x] 在线实验
  - [x] Mind2Web-Live-SeeAct-V
  - [x] AndroidWorld-SeeAct-V
- [ ] 数据合成管道（即将推出）
[x] 训练数据（V1）
[x] 在线演示（HF Spaces）

主要结果

GUI视觉定位：ScreenSpot（标准设置）

结果图片

ScreenSpot（标准）	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B（Qwen2-VL）	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude（计算机使用）			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B（Qwen2-VL）	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI视觉定位：ScreenSpot（代理设置）

规划器	代理-ScreenSpot	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

🔧 技术细节

暂未提供相关技术细节。

📄 许可证

本项目采用Apache-2.0许可证。

引用信息

如果您觉得本项目有用，请考虑引用我们的论文：

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

Qwen2-VL-7B-Instruct

介绍

我们很高兴推出Qwen2-VL，这是我们Qwen-VL模型的最新版本，凝聚了近一年的创新成果。

Qwen2-VL有哪些新特性？

关键改进：

对各种分辨率和比例图像的最先进理解：Qwen2-VL在视觉理解基准测试中取得了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解20分钟以上的视频：Qwen2-VL可以理解超过20分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。
可操作手机、机器人等的智能体：凭借复杂的推理和决策能力，Qwen2-VL可以与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为了服务全球用户，除了英语和中文，Qwen2-VL现在支持理解图像中不同语言的文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新：

朴素动态分辨率：与之前不同，Qwen2-VL可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。

- **多模态旋转位置嵌入（M-ROPE）**：将位置嵌入分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们有三个参数分别为20亿、70亿和720亿的模型。本仓库包含经过指令微调的70亿参数Qwen2-VL模型。更多信息请访问我们的博客和GitHub。

评估

图像基准测试

基准测试	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMU_验证集	51.8	49.8	60	54.1
DocVQA_测试集	91.6	90.8	-	94.5
InfoVQA_测试集	74.8	-	-	76.5
ChartQA_测试集	83.3	-	-	83.0
TextVQA_验证集	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
VCR_英文简单	-	73.88	83.60	89.70
VCR_中文简单	-	10.18	1.10	59.94
RealWorldQA	64.4	-	-	70.1
MME_总和	2210.3	2348.4	2003.4	2326.8
MMBench-EN_测试集	81.7	-	-	83.0
MMBench-CN_测试集	81.2	-	-	80.5
MMBench-V1.1_测试集	79.4	78.0	76.0	80.7
MMT-Bench_测试集	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0
HallBench_平均	45.2	48.1	46.1	50.6
MathVista_{测试迷你版}	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

视频基准测试

基准测试	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTest_测试集	-	57.1	-	62.3
EgoSchema_测试集	-	60.1	-	66.7
Video-MME_有无字幕	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

要求

Qwen2-VL的代码已包含在最新的Hugging face transformers中，我们建议您使用以下命令从源代码构建：

pip install git+https://github.com/huggingface/transformers

否则，您可能会遇到以下错误：

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入，包括base64编码、URL和交错的图像和视频。您可以使用以下命令安装：

pip install qwen-vl-utils

以下是一个代码示例，展示如何使用transformers和qwen_vl_utils来使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 我们建议启用flash_attention_2以获得更好的加速和内存节省，特别是在多图像和视频场景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# 模型中每个图像的视觉标记数量的默认范围是4-16384。您可以根据需要设置min_pixels和max_pixels，例如标记数量范围为256-1280，以平衡速度和内存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 将模型以半精度加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# 图像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 预处理输入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预期输出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成输出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息进行批量处理
messages = [messages1, messages1]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

局限性

虽然Qwen2-VL适用于广泛的视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至2023年6月，此日期之后的信息可能未涵盖。
个体和知识产权识别限制：模型识别特定个体或知识产权的能力有限，可能无法全面涵盖所有知名人物或品牌。
复杂指令处理能力有限：当面对复杂的多步骤指令时，模型的理解和执行能力有待提高。
计数准确性不足：特别是在复杂场景中，对象计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在3D空间中，模型对对象位置关系的推断不足，难以精确判断对象的相对位置。

这些局限性是模型优化和改进的持续方向，我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助，请随意引用我们的论文。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}