FireLLaVA-13b开源视觉语言模型 - 免费部署实现图像理解与文本生成

首页

Firellava 13b

由 fireworks-ai 开发

FireLLaVA-13B是基于开源大语言模型生成指令数据训练的视觉语言模型，支持图像理解和文本生成任务。

图像生成文本

Transformers

#多模态视觉问答 #汽车品牌识别 #开源大模型微调

下载量 59

发布时间 : 1/5/2024

模型简介

这是一个结合视觉和语言能力的多模态模型，能够理解图像内容并生成相关文本回答。

模型特点

多模态理解

能够同时处理图像和文本输入，理解图像内容并生成相关回答

大语言模型基础

基于强大的LLaMA 2语言模型构建，具备优秀的文本生成能力

多图像支持

理论上支持单次提示传入多张图像（但训练时未专门优化）

模型能力

图像内容理解

视觉问答

多模态对话

图像描述生成

使用案例

图像理解

物体识别

识别图像中的物体并回答相关问题

示例中正确识别出大众汽车

场景描述

生成图像的详细文字描述

能够描述图像中的场景和物体关系

智能助手

视觉问答助手

回答用户关于图像内容的各类问题

🚀 FireLLaVA 13B模型

FireLLaVA 13B是一个基于OSS LLM生成的指令跟随数据训练的视觉语言模型，可用于图像相关的问答任务。用户可以在Fireworks.ai平台体验该模型，也能使用huggingface transformers库本地运行。

🚀 快速开始

使用此模型需遵循Meta许可证。若要下载模型权重和分词器，请访问网站，接受Llama 2社区许可协议后在此处申请访问。

模型部署在Fireworks.ai上，你可以在此处进行尝试：https://app.fireworks.ai/models/fireworks/firellava-13b 。API端点也已提供，相关说明链接如下：https://readme.fireworks.ai/docs/querying-vision-language-models 。

若你想使用huggingface transformers库在本地运行该模型，请阅读以下说明。首先，确保安装transformers >= 4.35.3。该模型支持多图像和多提示生成，即你可以在提示中传入多张图像。同时，请遵循正确的提示模板（USER: xxx\nASSISTANT:），并在需要查询图像的位置添加标记 <image>。不过要注意，由于模型未在多图像输入的情况下进行训练，输入多张图像时模型性能可能会下降。

✨ 主要特性

视觉语言融合：LLaVA视觉语言模型，基于OSS LLM生成的指令跟随数据进行训练。
多图像支持：支持在提示中传入多张图像进行多图像和多提示生成。
灵活使用：既可以在Fireworks.ai平台使用，也能使用huggingface transformers库本地运行。

📦 安装指南

若要在本地运行模型，需确保安装transformers >= 4.35.3。

💻 使用示例

基础用法

使用pipeline进行图像到文本的转换：

from transformers import pipeline
from PIL import Image    
import requests

model_id = "fireworks-ai/FireLLaVA-13b"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'USER:  \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}]

高级用法

使用纯transformers库进行图像到文本的转换：

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "fireworks-ai/FireLLaVA-13b"

prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..."