siglip-so400m-patch16-256-i18n开源多模态模型 - 支持零样本图像分类与图文检索

首页

Siglip So400m Patch16 256 I18n

由 google 开发

基于SoViT主干网络、采用Sigmoid损失函数改进的多模态模型，支持零样本图像分类和图文检索

图像生成文本

Transformers

开源协议:Apache-2.0 #零样本图像分类 #多模态Sigmoid损失 #多语言图文匹配

下载量 230

发布时间 : 10/21/2024

模型简介

SigLIP是改进CLIP模型的视觉-语言预训练模型，通过Sigmoid损失函数优化训练过程，支持更大批次训练并在小批次场景表现更优

模型特点

Sigmoid损失函数

仅作用于图像-文本对，无需全局相似度归一化，支持更大批次训练

计算最优架构

采用SoViT-400m形状优化版本，实现计算效率最大化

多语言支持

在256分辨率多语言语料上预训练，支持国际化应用

模型能力

零样本图像分类

图文检索

多模态理解

使用案例

内容分类

动物识别

识别图片中的猫、狗等动物

示例显示能准确区分猫狗图像

媒体分析

场景理解

识别图像中的活动类型（如演奏音乐、体育运动）

🚀 SigLIP (形状优化模型)

SigLIP 是一种零样本图像分类模型，它基于多语言语料库进行预训练，在图像分类和图像 - 文本检索等任务中表现出色。该模型由 Zhai 等人提出，为相关领域的研究和应用提供了新的解决方案。

🚀 快速开始

你可以使用此原始模型进行零样本图像分类和图像 - 文本检索等任务。更多版本可在模型中心查找。

✨ 主要特性

SigLIP 是一种多模态模型，它在 CLIP 的基础上采用了更好的损失函数。
其使用的 sigmoid 损失仅在图像 - 文本对上操作，无需对成对相似度进行全局归一化，这使得它在扩大批量大小时表现出色，同时在小批量大小下也能有良好性能。

💻 使用示例

基础用法

以下是使用该模型进行零样本图像分类的示例代码：

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch16-256-i18n")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch16-256-i18n")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

高级用法

也可以使用 pipeline API 简化操作：

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch16-256-i18n")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

更多代码示例可参考文档。

📚 详细文档

SigLIP 模型采用 SoViT - 400m 架构，这是一种形状优化版本，相关内容可参考论文 Getting ViT in Shape: Scaling Laws for Compute - Optimal Model Design 。该模型由 Zhai 等人在论文 Sigmoid Loss for Language Image Pre - Training 中提出，并首次在此仓库发布。

🔧 技术细节

训练数据

SigLIP 在 WebLI 数据集 (Chen 等人, 2023) 上进行预训练。

预处理

图像被调整大小/重新缩放至相同分辨率 (384x384)，并在 RGB 通道上使用均值 (0.5, 0.5, 0.5) 和标准差 (0.5, 0.5, 0.5) 进行归一化。
文本被分词并填充至相同长度 (64 个标记)。

计算资源

该模型在 16 个 TPU - v4 芯片上训练了三天。

📄 许可证

本模型采用 Apache - 2.0 许可证。

BibTeX 引用

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}