blip2-flan-t5-xxl开源视觉语言模型 - 免费部署实现图像到文本转换

首页

Blip2 Flan T5 Xxl

由 Salesforce 开发

BLIP-2是一种视觉语言模型，结合了图像编码器和大型语言模型Flan T5-xxl，用于图像到文本任务。

图像生成文本

Transformers

英语开源协议:MIT #图像描述生成 #视觉问答 #冻结参数训练

下载量 6,419

发布时间 : 2/9/2023

模型简介

BLIP-2模型通过冻结图像编码器和大型语言模型Flan T5-xxl，训练查询转换器（Q-Former）来弥合图像和文本之间的嵌入空间差距，支持图像描述生成、视觉问答等任务。

模型特点

冻结预训练模型

保持图像编码器和语言模型冻结，仅训练查询转换器，减少训练成本。

多任务支持

支持图像描述生成、视觉问答和类似聊天的对话任务。

高效嵌入空间转换

通过查询转换器将图像嵌入转换为语言模型可理解的查询嵌入。

模型能力

图像描述生成

视觉问答

图像文本对话

使用案例

图像理解

图像描述生成

为输入图像生成自然语言描述。

视觉问答

回答关于图像内容的自然语言问题。

交互式应用

图像对话系统

基于图像和文本输入的对话生成。

🚀 BLIP-2, Flan T5-xxl，仅预训练版本

BLIP-2模型借助了Flan T5-xxl（一个大型语言模型）的能力。该模型由Li等人在论文BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models中提出，并首次在此仓库发布。

免责声明：发布BLIP-2的团队并未为此模型撰写模型卡片，本模型卡片由Hugging Face团队撰写。

🚀 快速开始

你可以直接使用该原始模型，根据图像和可选文本进行条件文本生成。你可以在模型中心查找针对你感兴趣的任务进行微调后的版本。

✨ 主要特性

BLIP-2可用于以下任务：

图像描述
视觉问答（VQA）
通过将图像和之前的对话作为提示输入模型，进行类似聊天的对话

📚 详细文档

模型描述

BLIP-2由3个模型组成：一个类似CLIP的图像编码器、一个查询变换器（Q-Former）和一个大型语言模型。

作者从预训练的检查点初始化图像编码器和大型语言模型的权重，并在训练查询变换器时保持它们不变。查询变换器是一个类似BERT的变换器编码器，它将一组“查询令牌”映射到查询嵌入，从而弥合图像编码器的嵌入空间和大型语言模型之间的差距。

该模型的目标很简单，即根据查询嵌入和之前的文本预测下一个文本令牌。

模型架构

直接使用和下游使用

你可以使用原始模型根据图像和可选文本进行条件文本生成。可查看模型中心，查找针对你感兴趣的任务微调后的版本。

偏差、风险、局限性和伦理考量

BLIP2-FlanT5使用现成的Flan-T5作为语言模型，它继承了Flan-T5相同的风险和局限性：

根据Rae等人（2021年）的研究，包括Flan-T5在内的语言模型有可能被用于有害的文本生成。在没有事先评估特定应用的安全性和公平性问题的情况下，不应直接将Flan-T5用于任何应用。

BLIP2在从互联网收集的图像文本数据集（如LAION）上进行了微调。因此，该模型本身可能容易生成不适当的内容，或者复制底层数据中固有的偏差。

BLIP2尚未在现实世界的应用中进行测试，不应直接部署到任何应用中。研究人员应首先仔细评估该模型在其部署的特定环境中的安全性和公平性。

伦理考量

本版本仅用于支持学术论文的研究目的。我们的模型、数据集和代码并非专门为所有下游用途而设计或评估。我们强烈建议用户在部署此模型之前，评估并解决与准确性、安全性和公平性相关的潜在问题。我们鼓励用户考虑人工智能的常见局限性，遵守适用的法律，并在选择用例时采用最佳实践，特别是在错误或滥用可能会对人们的生活、权利或安全产生重大影响的高风险场景中。有关用例的更多指导，请参考我们的使用条款和人工智能使用条款。

💻 使用示例

基础用法

在CPU上运行模型

点击展开

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

高级用法

在GPU上运行模型

全精度

点击展开

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

半精度（`float16`）

点击展开

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

8位精度（`int8`）

点击展开

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))