Paligemma-3b-ft-cococap-224开源视觉语言模型 - 多语言支持适用多视觉语言任务

首页

Paligemma 3b Ft Cococap 224

由 google 开发

PaliGemma是一款多功能轻量级视觉语言模型（VLM），支持多语言输入输出，适用于多种视觉语言任务。

图像生成文本

Transformers

#多模态视觉语言 #轻量级VLM #多语言字幕生成

下载量 209

发布时间 : 5/13/2024

模型简介

PaliGemma基于开放组件构建，结合了SigLIP视觉模型和Gemma语言模型，能够处理图像和短视频字幕、视觉问答、文本阅读、目标检测和分割等任务。

模型特点

多功能性

能够处理多种视觉语言任务，如问答、字幕生成、分割等。

多语言支持

支持多种语言的输入和输出。

轻量级设计

模型参数相对较少，便于在不同设备上进行研究和应用。

模型能力

图像字幕生成

视觉问答

文本阅读

目标检测

目标分割

使用案例

多媒体处理

图像字幕生成

为图像或短视频生成多语言字幕。

生成准确描述图像内容的字幕

视觉问答

回答关于图像内容的自然语言问题。

提供准确的问题答案

计算机视觉

目标检测

检测图像中的目标并输出边界框坐标。

精确识别和定位图像中的目标

目标分割

对图像中的目标进行像素级分割。

生成精确的目标分割掩码

🚀 PaliGemma模型卡片

PaliGemma是一款多功能轻量级视觉语言模型（VLM），它以图像和文本作为输入，并生成文本输出，支持多语言。该模型适用于图像和短视频字幕、视觉问答、文本阅读、目标检测和目标分割等多种视觉语言任务。

🚀 快速开始

若要在Hugging Face上使用PaliGemma模型，您需要查看并同意Google的使用许可。请确保您已登录Hugging Face，然后点击下方按钮，请求将立即得到处理。 [确认许可](javascript:void(0))

✨ 主要特性

多功能性：能够处理多种视觉语言任务，如问答、字幕生成、分割等。
多语言支持：支持多种语言的输入和输出。
轻量级设计：模型参数相对较少，便于在不同设备上进行研究和应用。

📦 安装指南

若要使用4位或8位精度自动运行推理，您需要安装bitsandbytes：

pip install bitsandbytes accelerate

💻 使用示例

基础用法

在CPU上以默认精度（float32）运行：

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙语创建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

高级用法

在CUDA上以其他精度运行

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙语创建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

以4位/8位加载

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
from transformers import BitsAndBytesConfig

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=quantization_config
).eval()
processor = AutoProcessor.from_pretrained(model_id)

# 指示模型用西班牙语创建字幕
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

📚 详细文档

模型信息

模型概述

PaliGemma受PaLI - 3启发，基于开放组件（如SigLIP视觉模型和Gemma语言模型）构建。它由一个Transformer解码器和一个视觉Transformer图像编码器组成，总共有30亿个参数。

输入：图像和文本字符串，如为图像添加字幕的提示或问题。
输出：针对输入生成的文本，如图像字幕、问题答案、目标边界框坐标列表或分割码字。

模型数据

预训练数据集：PaliGemma在多个数据集的混合上进行预训练，包括WebLI、CC3M - 35L、VQ²A - CC3M - 35L/VQG - CC3M - 35L、OpenImages和WIT。
数据责任过滤：为了在干净的数据上训练模型，对WebLI应用了多种过滤方法，包括色情图像过滤、文本安全过滤、文本毒性过滤、文本个人信息过滤等。