UGround-V1-72B开源视觉定位模型 - 免费用于图像文本到文本多模态任务

首页

Uground V1 72B

由 osunlp 开发

UGround是一款强大的GUI视觉定位模型，采用简单配方训练，专注于图像文本到文本的多模态任务。

图像生成文本

Transformers

英语开源协议:其他 #多模态GUI定位 #视觉指令理解 #跨平台控制

下载量 129

发布时间 : 1/11/2025

模型简介

UGround是一款由OSUNLP与Orby AI合作开发的视觉定位模型，基于Qwen2-VL架构，能够处理图像与文本之间的多模态交互任务。

模型特点

强大的GUI视觉定位能力

UGround能够准确理解和定位图形用户界面中的元素，实现高效的图像文本交互。

多模态支持

模型支持图像和文本之间的多模态交互，能够处理复杂的视觉和语言任务。

基于Qwen2-VL架构

采用先进的Qwen2-VL-72B架构，具备强大的计算能力和处理效率。

模型能力

图像文本交互

GUI元素定位

多模态任务处理

使用案例

GUI自动化

屏幕元素定位

用于自动化测试中定位和操作屏幕上的GUI元素。

提高自动化测试的准确性和效率。

多模态交互

图像描述生成

根据图像内容生成详细的文本描述。

提升图像理解和描述的质量。

🚀 UGround-V1-72B （基于Qwen2-VL）(无LoRA)

UGround是一个强大的图形用户界面（GUI）视觉定位模型，采用简单的训练方法进行训练。更多详细信息请查看我们的主页和论文。本项目是OSUNLP和Orby AI的合作成果。雷达图

主页：https://osu-nlp-group.github.io/UGround/
代码仓库：https://github.com/OSU-NLP-Group/UGround
论文：https://arxiv.org/abs/2410.05243
演示：https://huggingface.co/spaces/orby-osu/UGround
联系人：苟博宇

✨ 主要特性

模型

模型版本1：

发布计划

[x] 模型权重
- [x] 初始版本（论文中使用的版本）
- [x] 基于Qwen2-VL的V1版本
  - [x] 2B
  - [x] 7B
  - [x] 72B
[x] 代码
- [x] UGround的推理代码（初始版本和基于Qwen2-VL的版本）
- [x] 离线实验（代码、结果和有用资源）
  - [x] ScreenSpot（以及由GPT-4/4o生成的指代表达）
  - [x] 多模态Mind2Web
  - [x] OmniAct
  - [x] 安卓控制
- [x] 在线实验
  - [x] Mind2Web-Live-SeeAct-V
  - [x] 安卓世界SeeAct-V
- [ ] 数据合成管道（即将推出）
[x] 训练数据（V1）
[x] 在线演示（HF Spaces）

主要结果

GUI视觉定位：ScreenSpot（标准设置）

ScreenSpot（标准）	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B （基于Qwen2-VL）	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude（计算机使用）			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B （基于Qwen2-VL）	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B （基于Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI视觉定位：ScreenSpot（代理设置）

规划器	代理-ScreenSpot	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B （基于Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B （基于Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

📦 安装指南

vLLM服务器

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

或者

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

你可以在Qwen2-VL的官方仓库中找到更多关于训练和推理的说明。

💻 使用示例

视觉定位提示

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

📚 详细文档

引用信息

如果您觉得这项工作有用，请考虑引用我们的论文：

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

🚀 Qwen2-VL-72B-Instruct

介绍

我们很高兴推出Qwen2-VL，这是我们Qwen-VL模型的最新版本，凝聚了近一年的创新成果。

Qwen2-VL的新特性

关键增强功能

对各种分辨率和比例图像的最先进理解：Qwen2-VL在视觉理解基准测试中取得了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解长达20分钟以上的视频：Qwen2-VL可以理解超过20分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。
可操作手机、机器人等的代理：凭借复杂推理和决策能力，Qwen2-VL可以与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为了服务全球用户，除了英语和中文，Qwen2-VL现在支持理解图像中不同语言的文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新

朴素动态分辨率：与以往不同，Qwen2-VL可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更接近人类的视觉处理体验。

多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们有三个分别具有20亿、80亿和720亿参数的模型。本仓库包含经过指令微调的72B Qwen2-VL模型。更多信息，请访问我们的博客和GitHub。

评估

图像基准测试

基准测试	之前的最优模型 ^{（开源大视觉语言模型）}	Claude-3.5 Sonnet	GPT-4o	Qwen2-VL-72B
MMMU_验证集	58.3	68.3	69.1	64.5
DocVQA_测试集	94.1	95.2	92.8	96.5
InfoVQA_测试集	82.0	-	-	84.5
ChartQA_测试集	88.4	90.8	85.7	88.3
TextVQA_验证集	84.4	-	-	85.5
OCRBench	852	788	736	877
MTVQA	17.3	25.7	27.8	30.9
VCR_英文简单	84.67	63.85	91.55	91.93
VCR_中文简单	22.09	1.0	14.87	65.37
RealWorldQA	72.2	60.1	75.4	77.8
MME_总和	2414.7	1920.0	2328.7	2482.7
MMBench-EN_测试集	86.5	79.7	83.4	86.5
MMBench-CN_测试集	86.3	80.7	82.1	86.6
MMBench-V1.1_测试集	85.5	78.5	82.2	85.9
MMT-Bench_测试集	63.4	-	65.5	71.7
MMStar	67.1	62.2	63.9	68.3
MMVet_GPT-4-Turbo	65.7	66.0	69.1	74.0
HallBench_平均值	55.2	49.9	55.0	58.1
MathVista_{测试迷你集}	67.5	67.7	63.8	70.5
MathVision	16.97	-	30.4	25.9

视频基准测试

基准测试	之前的最优模型 ^{（开源大视觉语言模型）}	Gemini 1.5-Pro	GPT-4o	Qwen2-VL-72B
MVBench	69.6	-	-	73.6
PerceptionTest_测试集	66.9	-	-	68.0
EgoSchema_测试集	62.0	63.2	72.2	77.9
Video-MME _{（有无字幕）}	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8

代理基准测试

	基准测试	指标	之前的最优模型	GPT-4o	Qwen2-VL-72B
通用	FnCall^[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
游戏	数轴	SR	89.4^[2]	91.5	100.0
	21点	SR	40.2^[2]	34.5	42.6
	EZPoint	SR	50.0^[2]	85.5	100.0
	24点	SR	2.6^[2]	3.0	4.5
安卓	AITZ	TM	83.0^[3]	70.0	89.6
		EM	47.7^[3]	35.3	72.1
AI2THOR	ALFRED_{验证集-未见场景}	SR	67.7^[4]	-	67.8
		GC	75.3^[4]	-	75.8
视觉语言导航	R2R_{验证集-未见场景}	SR	79.0	43.7^[5]	51.7
	REVERIE_{验证集-未见场景}	SR	61.0	31.6^[5]	31.0

SR、GC、TM和EM分别是成功率、目标条件成功率、类型匹配和精确匹配的缩写。ALFRED由SAM^[6]支持。

由Qwen团队自有的函数调用基准测试
通过强化学习将大视觉语言模型微调为决策代理
安卓动物园：图形用户界面代理的行动思维链
ThinkBot：通过思维链推理进行具身指令跟随
MapGPT：用于视觉语言导航的自适应路径规划地图引导提示
任意分割模型

多语言基准测试

模型	AR	DE	FR	IT	JA	KO	RU	TH	VI	平均
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

要求

Qwen2-VL的代码已经集成到最新的Hugging face transformers中，我们建议您使用以下命令从源代码进行构建，否则可能会遇到以下错误：

KeyError: 'qwen2_vl'

pip install git+https://github.com/huggingface/transformers

快速开始

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入，包括Base64、URL以及交错的图像和视频。您可以使用以下命令进行安装：

pip install qwen-vl-utils

以下是一个代码片段，展示了如何使用transformers和qwen_vl_utils来使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默认：将模型加载到可用设备上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# 我们建议启用flash_attention_2以获得更好的加速和内存节省，特别是在多图像和视频场景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默认处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 模型中每张图像的视觉标记数量的默认范围是4 - 16384。您可以根据需要设置最小和最大像素数，例如将标记数量范围设置为256 - 1280，以平衡速度和内存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 在可用设备上以半精度加载模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# 图像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 预处理输入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预期输出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成输出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# 包含多个图像和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# 包含图像列表作为视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含视频和文本查询的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合并消息进行批量处理
messages = [messages1, messages1]

# 批量推理准备
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

局限性

虽然Qwen2-VL适用于广泛的视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至2023年6月，此日期之后的信息可能未被涵盖。
个体和知识产权识别限制：模型识别特定个体或知识产权的能力有限，可能无法全面涵盖所有知名人物或品牌。
复杂指令处理能力不足：当面对复杂的多步骤指令时，模型的理解和执行能力有待提高。
计数准确性不足：特别是在复杂场景中，对象计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在三维空间中，模型对对象位置关系的推断不足，难以精确判断对象的相对位置。

这些局限性是模型优化和改进的持续方向，我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助，请随意引用我们的论文。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}