Ferret-UI-Gemma2b开源多模态大模型 - 助力UI引用、定位与推理任务

首页

Ferret UI Gemma2b

由 jadechoghari 开发

Ferret-UI是首个专注于用户界面的多模态大语言模型，基于Gemma-2B构建，专为UI引用、定位和推理任务设计。

图像生成文本

Transformers

#UI多模态理解 #界面元素定位 #屏幕内容推理

下载量 302

发布时间 : 10/9/2024

模型简介

Ferret-UI是一个多模态大语言模型，专注于用户界面(UI)的理解和分析，能够执行复杂的UI任务，如引用、定位和推理。

模型特点

UI专用多模态模型

首个专注于用户界面理解的多模态大语言模型

精准定位能力

能够精确定位UI元素并提供边界框坐标

复杂推理能力

可执行复杂的UI相关推理任务

模型能力

UI元素识别

UI元素定位

UI界面描述

UI元素交互分析

UI布局理解

使用案例

移动应用界面分析

应用界面元素识别

识别并描述移动应用界面中的各种元素

准确识别按钮、文本区域等UI组件

界面导航分析

分析应用界面的导航结构和流程

理解界面间的跳转关系和用户操作路径

UI自动化测试

UI元素验证

验证UI元素的存在和位置

确保界面元素按设计规范呈现

🚀 Ferret-UI（Gemma-2B版本）

Ferret-UI是首个以用户界面（UI）为中心的多模态大语言模型（MLLM），专为指称、定位和推理任务而设计。它基于Gemma-2B和Llama-3-8B构建，能够执行复杂的UI任务。此为Ferret-UI的Gemma-2B版本，其灵感来源于苹果公司的这篇论文。

🚀 快速开始

📦 安装指南

你需要先将builder.py、conversation.py、inference.py、model_UI.py和mm_utils.py下载到本地。

wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py

💻 使用示例

基础用法

from inference import inference_and_run
image_path = "appstore_reminders.png"
prompt = "Describe the image in details"

# Call the function without a box
inference_text = inference_and_run(image_path, prompt, conv_mode="ferret_gemma_instruct", model_path="jadechoghari/Ferret-UI-Gemma2b")

# Output processed text
print("Inference Text:", inference_text)

高级用法

# Task with bounding boxes
image_path = "appstore_reminders.png"
prompt = "What's inside the selected region?"
box = [189, 906, 404, 970]

inference_text = inference_and_run(
    image_path=image_path, 
    prompt=prompt, 
    conv_mode="ferret_gemma_instruct", 
    model_path="jadechoghari/Ferret-UI-Gemma2b", 
    box=box
)
# you could also pass process_image=True
# to output: processed_image, inference_text = inference_and_run(...., process_image=True)

print("Inference Text:", inference_text)

定位提示

# GROUNDING PROMPTS
GROUNDING_TEMPLATES = [
    '\nProvide the bounding boxes of the mentioned objects.',
    '\nInclude the coordinates for each mentioned object.',
    '\nLocate the objects with their coordinates.',
    '\nAnswer in [x1, y1, x2, y2] format.',
    '\nMention the objects and their locations using the format [x1, y1, x2, y2].',
    '\nDraw boxes around the mentioned objects.',
    '\nUse boxes to show where each thing is.',
    '\nTell me where the objects are with coordinates.',
    '\nList where each object is with boxes.',
    '\nShow me the regions with boxes.'
]