模型简介
模型特点
模型能力
使用案例
🚀 UGround-V1-7B (基于Qwen2-VL)
UGround是一个强大的GUI视觉定位模型,采用简单的方法进行训练。更多详细信息请查看我们的主页和论文。本项目是俄亥俄州立大学自然语言处理小组和Orby AI的合作成果。
- 主页:https://osu-nlp-group.github.io/UGround/
- 代码仓库:https://github.com/OSU-NLP-Group/UGround
- 论文(ICLR'25口头报告):https://arxiv.org/abs/2410.05243
- 演示:https://huggingface.co/spaces/orby-osu/UGround
- 联系人:苟博宇
✨ 主要特性
- 强大的GUI视觉定位能力:在多个GUI视觉定位任务中表现出色,如ScreenSpot等。
- 多模型版本:提供不同参数规模的模型版本,包括2B、7B和72B。
- 丰富的实验支持:涵盖离线和在线实验,提供推理代码和实验结果。
- 数据合成管道:即将推出数据合成管道,方便用户进行数据生成。
📦 安装指南
可参考Qwen2-VL的官方仓库获取更多训练和推理的说明。
💻 使用示例
vLLM服务器
vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
或者
python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16
视觉定位提示
def format_openai_template(description: str, base64_image):
return [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
{
"type": "text",
"text": f"""
Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
- Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
- If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
- Your answer should be a single string (x, y) corresponding to the point of the interest.
Description: {description}
Answer:"""
},
],
},
]
messages = format_openai_template(description, base64_image)
completion = await client.chat.completions.create(
model=args.model_path,
messages=messages,
temperature=0 # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)
# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)
📚 详细文档
模型
发布计划
- [x] 模型权重
- [x] 初始版本(论文中使用的版本)
- [x] 基于Qwen2-VL的V1版本
- [x] 2B
- [x] 7B
- [x] 72B
- [x] 代码
- [x] UGround的推理代码(初始版本和基于Qwen2-VL的版本)
- [x] 离线实验(代码、结果和有用资源)
- [x] ScreenSpot(以及由GPT-4/4o生成的指代表达)
- [x] 多模态Mind2Web
- [x] OmniAct
- [x] 安卓控制
- [x] 在线实验
- [x] Mind2Web-Live-SeeAct-V
- [x] AndroidWorld-SeeAct-V
- [ ] 数据合成管道(即将推出)
- [x] 训练数据(V1)
- [x] 在线演示(HF Spaces)
主要结果
GUI视觉定位:ScreenSpot(标准设置)
ScreenSpot(标准) | 架构 | SFT数据 | 移动文本 | 移动图标 | 桌面文本 | 桌面图标 | 网页文本 | 网页图标 | 平均 |
---|---|---|---|---|---|---|---|---|---|
InternVL-2-4B | InternVL-2 | 9.2 | 4.8 | 4.6 | 4.3 | 0.9 | 0.1 | 4.0 | |
Groma | Groma | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 | |
Qwen-VL | Qwen-VL | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 | |
MiniGPT-v2 | MiniGPT-v2 | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 | |
GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 | ||
GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 | ||
Fuyu | Fuyu | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 | |
Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
Ferret-UI-Llama8b | Ferret-UI | 64.5 | 32.3 | 45.9 | 11.4 | 28.3 | 11.7 | 32.3 | |
Qwen2-VL | Qwen2-VL | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 | |
CogAgent | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 | |
SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
OmniParser | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 | ||
UGround | LLaVA-UGround-V1 | UGround-V1 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
Molmo-7B-D | 85.4 | 69.0 | 79.4 | 70.7 | 81.3 | 65.5 | 75.2 | ||
UGround-V1-2B(Qwen2-VL) | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
Molmo-72B | 92.7 | 79.5 | 86.1 | 64.3 | 83.0 | 66.0 | 78.6 | ||
Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
Claude(计算机使用) | 98.2 | 85.6 | 79.9 | 57.1 | 92.2 | 84.5 | 82.9 | ||
Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 83.0 |
Project Mariner | 84.0 | ||||||||
UGround-V1-7B(Qwen2-VL) | Qwen2-VL | UGround-V1 | 93.0 | 79.9 | 93.8 | 76.4 | 90.9 | 84.0 | 86.3 |
AGUVIS-72B | Qwen2-VL | Aguvis-Stage-1&2 | 94.5 | 85.2 | 95.4 | 77.9 | 91.3 | 85.9 | 88.4 |
UGround-V1-72B(Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 83.4 | 94.9 | 85.7 | 90.4 | 87.9 | 89.4 |
GUI视觉定位:ScreenSpot(代理设置)
规划器 | 代理-ScreenSpot | 架构 | SFT数据 | 移动文本 | 移动图标 | 桌面文本 | 桌面图标 | 网页文本 | 网页图标 | 平均 |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | Qwen-VL | Qwen-VL | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 | |
GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
GPT-4o | OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 94.1 | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | 79.9 | 90.2 | 66.4 | 92.6 | 79.1 | 83.7 |
GPT-4o | UGround-V1 | LLaVA-UGround-V1 | UGround-V1 | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | 81.4 |
GPT-4o | UGround-V1-2B(Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
GPT-4o | UGround-V1-7B(Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 79.9 | 93.3 | 73.6 | 89.6 | 73.3 | 84.0 |
🔧 技术细节
暂未提供相关技术细节。
📄 许可证
本项目采用Apache-2.0许可证。
引用信息
如果您觉得本项目有用,请考虑引用我们的论文:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}
Qwen2-VL-7B-Instruct
介绍
我们很高兴推出Qwen2-VL,这是我们Qwen-VL模型的最新版本,凝聚了近一年的创新成果。
Qwen2-VL有哪些新特性?
关键改进:
- 对各种分辨率和比例图像的最先进理解:Qwen2-VL在视觉理解基准测试中取得了最先进的性能,包括MathVista、DocVQA、RealWorldQA、MTVQA等。
- 理解20分钟以上的视频:Qwen2-VL可以理解超过20分钟的视频,用于高质量的基于视频的问答、对话、内容创作等。
- 可操作手机、机器人等的智能体:凭借复杂的推理和决策能力,Qwen2-VL可以与手机、机器人等设备集成,根据视觉环境和文本指令进行自动操作。
- 多语言支持:为了服务全球用户,除了英语和中文,Qwen2-VL现在支持理解图像中不同语言的文本,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。
模型架构更新:
- 朴素动态分辨率:与之前不同,Qwen2-VL可以处理任意图像分辨率,将其映射到动态数量的视觉标记,提供更接近人类的视觉处理体验。
我们有三个参数分别为20亿、70亿和720亿的模型。本仓库包含经过指令微调的70亿参数Qwen2-VL模型。更多信息请访问我们的博客和GitHub。
评估
图像基准测试
基准测试 | InternVL2-8B | MiniCPM-V 2.6 | GPT-4o-mini | Qwen2-VL-7B |
---|---|---|---|---|
MMMU验证集 | 51.8 | 49.8 | 60 | 54.1 |
DocVQA测试集 | 91.6 | 90.8 | - | 94.5 |
InfoVQA测试集 | 74.8 | - | - | 76.5 |
ChartQA测试集 | 83.3 | - | - | 83.0 |
TextVQA验证集 | 77.4 | 80.1 | - | 84.3 |
OCRBench | 794 | 852 | 785 | 845 |
MTVQA | - | - | - | 26.3 |
VCR英文简单 | - | 73.88 | 83.60 | 89.70 |
VCR中文简单 | - | 10.18 | 1.10 | 59.94 |
RealWorldQA | 64.4 | - | - | 70.1 |
MME总和 | 2210.3 | 2348.4 | 2003.4 | 2326.8 |
MMBench-EN测试集 | 81.7 | - | - | 83.0 |
MMBench-CN测试集 | 81.2 | - | - | 80.5 |
MMBench-V1.1测试集 | 79.4 | 78.0 | 76.0 | 80.7 |
MMT-Bench测试集 | - | - | - | 63.7 |
MMStar | 61.5 | 57.5 | 54.8 | 60.7 |
MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 |
HallBench平均 | 45.2 | 48.1 | 46.1 | 50.6 |
MathVista测试迷你版 | 58.3 | 60.6 | 52.4 | 58.2 |
MathVision | - | - | - | 16.3 |
视频基准测试
基准测试 | Internvl2-8B | LLaVA-OneVision-7B | MiniCPM-V 2.6 | Qwen2-VL-7B |
---|---|---|---|---|
MVBench | 66.4 | 56.7 | - | 67.0 |
PerceptionTest测试集 | - | 57.1 | - | 62.3 |
EgoSchema测试集 | - | 60.1 | - | 66.7 |
Video-MME有无字幕 | 54.0/56.9 | 58.2/- | 60.9/63.6 | 63.3/69.0 |
要求
Qwen2-VL的代码已包含在最新的Hugging face transformers中,我们建议您使用以下命令从源代码构建:
pip install git+https://github.com/huggingface/transformers
否则,您可能会遇到以下错误:
KeyError: 'qwen2_vl'
快速开始
我们提供了一个工具包,帮助您更方便地处理各种类型的视觉输入,包括base64编码、URL和交错的图像和视频。您可以使用以下命令安装:
pip install qwen-vl-utils
以下是一个代码示例,展示如何使用transformers
和qwen_vl_utils
来使用聊天模型:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# 默认:将模型加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
# 我们建议启用flash_attention_2以获得更好的加速和内存节省,特别是在多图像和视频场景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-7B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# 模型中每个图像的视觉标记数量的默认范围是4-16384。您可以根据需要设置min_pixels和max_pixels,例如标记数量范围为256-1280,以平衡速度和内存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# 推理准备
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# 推理:生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
不使用qwen_vl_utils
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# 将模型以半精度加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# 图像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# 预处理输入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预期输出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# 推理:生成输出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
多图像推理
# 包含多个图像和文本查询的消息
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# 推理准备
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
视频推理
# 包含图像列表作为视频和文本查询的消息
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# 包含视频和文本查询的消息
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# 推理准备
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
批量推理
# 批量推理的示例消息
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# 合并消息进行批量处理
messages = [messages1, messages1]
# 批量推理准备
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
更多使用提示
对于输入图像,我们支持本地文件、base64编码和URL。对于视频,目前我们仅支持本地文件。
# 您可以直接在文本中想要的位置插入本地文件路径、URL或base64编码的图像。
## 本地文件路径
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## 图像URL
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Base64编码的图像
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]
为提高性能调整图像分辨率
模型支持广泛的分辨率输入。默认情况下,它使用原生分辨率进行输入,但更高的分辨率可以提高性能,但会增加计算量。用户可以设置最小和最大像素数,以实现适合自己需求的最佳配置,例如标记数量范围为256-1280,以平衡速度和内存使用。
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)
此外,我们提供了两种方法来精细控制输入到模型的图像大小:
- 定义min_pixels和max_pixels:图像将被调整大小以保持其宽高比在min_pixels和max_pixels的范围内。
- 指定确切的尺寸:直接设置
resized_height
和resized_width
。这些值将被四舍五入到最接近的28的倍数。
# min_pixels和max_pixels
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"resized_height": 280,
"resized_width": 420,
},
{"type": "text", "text": "Describe this image."},
],
}
]
# resized_height和resized_width
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"min_pixels": 50176,
"max_pixels": 50176,
},
{"type": "text", "text": "Describe this image."},
],
}
]
局限性
虽然Qwen2-VL适用于广泛的视觉任务,但了解其局限性同样重要。以下是一些已知的限制:
- 缺乏音频支持:当前模型无法理解视频中的音频信息。
- 数据时效性:我们的图像数据集更新至2023年6月,此日期之后的信息可能未涵盖。
- 个体和知识产权识别限制:模型识别特定个体或知识产权的能力有限,可能无法全面涵盖所有知名人物或品牌。
- 复杂指令处理能力有限:当面对复杂的多步骤指令时,模型的理解和执行能力有待提高。
- 计数准确性不足:特别是在复杂场景中,对象计数的准确性不高,需要进一步改进。
- 空间推理能力较弱:特别是在3D空间中,模型对对象位置关系的推断不足,难以精确判断对象的相对位置。
这些局限性是模型优化和改进的持续方向,我们致力于不断提升模型的性能和应用范围。
引用
如果您觉得我们的工作有帮助,请随意引用我们的论文。
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}








