开源Ferret-UI-Llama8b模型 - 执行复杂UI任务，如引用定位推理

首页

Ferret UI Llama8b

由 jadechoghari 开发

Ferret-UI是首个专注于用户界面的多模态大语言模型（MLLM），基于Llama-3-8B构建，能够执行复杂的UI任务，如引用、定位和推理。

图像生成文本

Transformers

#UI多模态理解 #界面元素定位 #屏幕内容推理

下载量 256

发布时间 : 10/9/2024

模型简介

Ferret-UI是一个多模态大语言模型，专门设计用于处理用户界面相关的任务，包括引用、定位和推理。它基于Llama-3-8B架构，能够理解和分析UI图像，并提供详细的描述和定位信息。

模型特点

多模态能力

结合视觉和语言处理能力，能够理解和分析UI图像。

UI任务优化

专为UI相关的引用、定位和推理任务设计，能够高效处理复杂的UI分析。

高精度定位

支持边界框定位，能够精确标出UI元素的位置。

模型能力

UI图像分析

文本生成

边界框定位

多模态推理

使用案例

UI自动化测试

UI元素定位

自动识别和定位UI中的特定元素，如按钮、文本框等。

提高测试效率和准确性。

辅助功能

UI描述生成

为视障用户生成UI的详细描述。

提升无障碍访问体验。

🚀 Ferret - UI

Ferret - UI是首个以用户界面（UI）为中心的多模态大语言模型（MLLM），专为指称、定位和推理任务而设计。它基于Gemma - 2B和Llama - 3 - 8B构建，能够执行复杂的UI任务。这是Ferret - UI的Llama - 3 - 8B版本，其灵感来源于苹果公司的这篇论文。

🚀 快速开始

📦 安装指南

你首先需要将builder.py、conversation.py、inference.py、model_UI.py和mm_utils.py下载到本地。

wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py

💻 使用示例

基础用法

from inference import inference_and_run
image_path = "appstore_reminders.png"
prompt = "Describe the image in details"

# Call the function without a box
inference_text = inference_and_run(image_path, prompt)

print("Inference Text:", inference_text)

高级用法

# Task with bounding boxes
image_path = "appstore_reminders.png"
prompt = "What's inside the selected region?"
box = [189, 906, 404, 970]

inference_text = inference_and_run(
    image_path=image_path, 
    prompt=prompt, 
    conv_mode="ferret_llama_3", 
    model_path="jadechoghari/Ferret-UI-Llama8b", 
    box=box
)

print("Inference Text:", inference_text)

# GROUNDING PROMPTS
GROUNDING_TEMPLATES = [
	'\nProvide the bounding boxes of the mentioned objects.',
 	'\nInclude the coordinates for each mentioned object.',
	'\nLocate the objects with their coordinates.',
	'\nAnswer in [x1, y1, x2, y2] format.',
	'\nMention the objects and their locations using the format [x1, y1, x2, y2].',
	'\nDraw boxes around the mentioned objects.',
	'\nUse boxes to show where each thing is.',
	'\nTell me where the objects are with coordinates.',
	'\nList where each object is with boxes.',
	'\nShow me the regions with boxes.'
]