language: ja
license: apache-2.0
tags:
- clip
- 日语-clip
pipeline_tag: 特征提取
clip-japanese-base
这是由LY Corporation开发的日语CLIP(对比性语言-图像预训练)模型。该模型基于约10亿个网络收集的图像-文本对进行训练,适用于多种视觉任务,包括零样本图像分类、文本到图像或图像到文本检索。
使用方法
- 安装依赖包
pip install pillow requests sentencepiece transformers torch timm
- 运行代码
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("标签概率:", text_probs)
模型架构
该模型采用Eva02-B Transformer架构作为图像编码器,以及一个12层的BERT作为文本编码器。文本编码器初始权重来自rinna/japanese-clip-vit-b-16。
评估
数据集
结果
模型 |
图像编码器参数 |
文本编码器参数 |
STAIR Captions (R@1) |
Recruit Datasets (acc@1) |
ImageNet-1K (acc@1) |
Ours |
86M(Eva02-B) |
100M(BERT) |
0.30 |
0.89 |
0.58 |
Stable-ja-clip |
307M(ViT-L) |
100M(BERT) |
0.24 |
0.77 |
0.68 |
Rinna-ja-clip |
86M(ViT-B) |
100M(BERT) |
0.13 |
0.54 |
0.56 |
Laion-clip |
632M(ViT-H) |
561M(XLM-RoBERTa) |
0.30 |
0.83 |
0.58 |
Hakuhodo-ja-clip |
632M(ViT-H) |
100M(BERT) |
0.21 |
0.82 |
0.46 |
许可证
Apache License, Version 2.0
引用
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}