license: mit
license_link: https://huggingface.co/microsoft/Florence-2-base-ft/resolve/main/LICENSE
pipeline_tag: image-to-text
tags:
Florence-2:推进多视觉任务的统一表征
模型概述
本Hub仓库包含微软Florence-2模型的HuggingFace transformers
实现。
Florence-2是一种先进的视觉基础模型,采用基于提示的方法处理广泛的视觉和视觉语言任务。该模型能通过简单文本提示执行图像描述、目标检测和分割等任务。其利用包含1.26亿张图像、54亿标注的FLD-5B数据集实现多任务学习。序列到序列架构使其在零样本和微调场景中均表现优异,成为极具竞争力的视觉基础模型。
技术资源:
模型 |
参数量 |
描述 |
Florence-2-base[HF] |
0.23B |
FLD-5B预训练模型 |
Florence-2-large[HF] |
0.77B |
FLD-5B预训练模型 |
Florence-2-base-ft[HF] |
0.23B |
下游任务微调模型 |
Florence-2-large-ft[HF] |
0.77B |
下游任务微调模型 |
快速开始
使用以下代码启动模型:
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)
任务能力
通过修改提示词,本模型可执行不同任务。首先定义执行函数:
点击展开
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
def run_example(task_prompt, text_input=None):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
print(parsed_answer)
支持任务示例:
点击展开
图像描述
prompt = "<CAPTION>"
run_example(prompt)
详细描述
prompt = "<DETAILED_CAPTION>"
run_example(prompt)
超详细描述
prompt = "<MORE_DETAILED_CAPTION>"
run_example(prompt)
描述到短语定位
需额外输入文本描述,输出格式:
{'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1,y1,x2,y2],...], 'labels': [...]}}
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>"
results = run_example(task_prompt, text_input="A green car parked in front of a yellow building.")
目标检测
输出格式:
{'<OD>': {'bboxes': [[x1,y1,x2,y2],...], 'labels': [...]} }
prompt = "<OD>"
run_example(prompt)
密集区域描述
输出格式:
{'<DENSE_REGION_CAPTION>' : {'bboxes': [...], 'labels': [...]} }
prompt = "<DENSE_REGION_CAPTION>"
run_example(prompt)
区域提议
输出格式:
{'<REGION_PROPOSAL>': {'bboxes': [...], 'labels': [...]}}
prompt = "<REGION_PROPOSAL>"
run_example(prompt)
文字识别
prompt = "<OCR>"
run_example(prompt)
带区域文字识别
输出格式:
{'<OCR_WITH_REGION>': {'quad_boxes': [[x1,y1,x2,y2,x3,y3,x4,y4],...], 'labels': [...]}}
prompt = "<OCR_WITH_REGION>"
run_example(prompt)
更多示例详见笔记本
性能基准
Florence-2零样本性能
下表展示通用视觉基础模型在图像描述和目标检测任务上的零样本表现(未使用评估任务训练数据):
方法 |
参数量 |
COCO描述测试CIDEr |
NoCaps验证CIDEr |
TextCaps验证CIDEr |
COCO检测验证mAP |
Flamingo |
80B |
84.3 |
- |
- |
- |
Florence-2-base |
0.23B |
133.0 |
118.7 |
70.1 |
34.7 |
Florence-2-large |
0.77B |
135.6 |
120.8 |
72.8 |
37.5 |
视觉语言任务表现对比:
方法 |
Flickr30k测试R@1 |
Refcoco验证准确率 |
Refcoco测试A准确率 |
Refcoco测试B准确率 |
Refcoco+验证准确率 |
Refcoco+测试A准确率 |
Refcoco+测试B准确率 |
Refcocog验证准确率 |
Refcocog测试准确率 |
Refcoco RES验证mIoU |
Kosmos-2 |
78.7 |
52.3 |
57.4 |
47.3 |
45.5 |
50.7 |
42.2 |
60.6 |
61.7 |
- |
Florence-2-base |
83.6 |
53.9 |
58.4 |
49.7 |
51.5 |
56.4 |
47.9 |
66.3 |
65.1 |
34.6 |
Florence-2-large |
84.4 |
56.3 |
61.6 |
51.4 |
53.6 |
57.9 |
49.9 |
68.0 |
67.0 |
35.8 |
Florence-2微调性能
下表对比专用模型与通用模型在描述和视觉问答任务的表现("▲"表示使用外部OCR输入):
方法 |
参数量 |
COCO描述测试CIDEr |
NoCaps验证CIDEr |
TextCaps验证CIDEr |
VQAv2测试准确率 |
TextVQA测试准确率 |
VizWiz VQA测试准确率 |
专用模型 |
|
|
|
|
|
|
|
CoCa |
2.1B |
143.6 |
122.4 |
- |
82.3 |
- |
- |
BLIP-2 |
7.8B |
144.5 |
121.6 |
- |
82.2 |
- |
- |
GIT2 |
5.1B |
145.0 |
126.9 |
148.6 |
81.7 |
67.3 |
71.0 |
Flamingo |
80B |
138.1 |
- |
- |
82.0 |
54.1 |
65.7 |
PaLI |
17B |
149.1 |
127.0 |
160.0▲ |
84.3 |
58.8 / 73.1▲ |
71.6 / 74.4▲ |
PaLI-X |
55B |
149.2 |
126.3 |
147.0 / 163.7▲ |
86.0 |
71.4 / 80.8▲ |
70.9 / 74.6▲ |
通用模型 |
|
|
|
|
|
|
|
Unified-IO |
2.9B |
- |
100.0 |
- |
77.9 |
- |
57.4 |
Florence-2-base-ft |
0.23B |
140.0 |
116.7 |
143.9 |
79.7 |
63.6 |
63.6 |
Florence-2-large-ft |
0.77B |
143.3 |
124.9 |
151.1 |
81.7 |
73.5 |
72.6 |
目标定位任务表现:
方法 |
参数量 |
COCO检测验证mAP |
Flickr30k测试R@1 |
RefCOCO验证准确率 |
RefCOCO测试A准确率 |
RefCOCO测试B准确率 |
RefCOCO+验证准确率 |
RefCOCO+测试A准确率 |
RefCOCO+测试B准确率 |
RefCOCOg验证准确率 |
RefCOCOg测试准确率 |
RefCOCO RES验证mIoU |
专用模型 |
|
|
|
|
|
|
|
|
|
|
|
|
SeqTR |
- |
- |
- |
83.7 |
86.5 |
81.2 |
71.5 |
76.3 |
64.9 |
74.9 |
74.2 |
- |
PolyFormer |
- |
- |
- |
90.4 |
92.9 |
87.2 |
85.0 |
89.8 |
78.0 |
85.8 |
85.9 |
76.9 |
UNINEXT |
0.74B |
60.6 |
- |
92.6 |
94.3 |
91.5 |
85.2 |
89.6 |
79.8 |
88.7 |
89.4 |
- |
Ferret |
13B |
- |
- |
89.5 |
92.4 |
84.4 |
82.8 |
88.1 |
75.2 |
85.8 |
86.3 |
- |
通用模型 |
|
|
|
|
|
|
|
|
|
|
|
|
UniTAB |
- |
- |
- |
88.6 |
91.1 |
83.8 |
81.0 |
85.4 |
71.6 |
84.6 |
84.7 |
- |
Florence-2-base-ft |
0.23B |
41.4 |
84.0 |
92.6 |
94.8 |
91.5 |
86.8 |
91.7 |
82.2 |
89.8 |
82.2 |
78.0 |
Florence-2-large-ft |
0.77B |
43.4 |
85.2 |
93.4 |
95.3 |
92.0 |
88.3 |
92.9 |
83.6 |
91.2 |
91.7 |
80.5 |
引用信息
@article{xiao2023florence,
title={Florence-2: Advancing a unified representation for a variety of vision tasks},
author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
journal={arXiv preprint arXiv:2311.06242},
year={2023}
}