license: apache-2.0
inference: false
pipeline_tag: zero-shot-image-classification
pipeline_tag: feature-extraction
inference:
parameters:
tags:
- clip
- zh
- image-text
- feature-extraction
太乙-CLIP-Roberta-102M-中文版
简介
首个开源的中文CLIP模型,基于1.23亿图文对进行预训练,文本编码器采用RoBERTa-base架构。
模型分类
需求 |
任务 |
系列 |
模型 |
参数量 |
特性 |
特殊领域 |
多模态 |
太乙系列 |
CLIP (Roberta) |
1.02亿 |
中文支持 |
模型信息
我们严格遵循CLIP的实验配置,以构建强大的视觉-语言表征体系。在开发中文版CLIP过程中,文本编码器选用中文RoBERTa-wwm,视觉编码器则采用CLIP的ViT-B-32架构。为提升预训练效率与稳定性,我们冻结视觉编码器参数,仅对语言编码器进行微调。预训练数据融合悟空数据集(1亿条)和360Zero数据集(2300万条),在A100x32集群上训练24轮(耗时7天)。据我们所知,本模型是Huggingface社区首个开源的中文CLIP实现。
下游任务表现
零样本图像分类
模型 |
数据集 |
Top1准确率 |
Top5准确率 |
太乙-CLIP-Roberta-102M-中文版 |
ImageNet1k中文版 |
42.85% |
71.48% |
零样本图文检索
模型 |
数据集 |
Top1 |
Top5 |
Top10 |
太乙-CLIP-Roberta-102M-中文版 |
Flickr30k中文测试集 |
46.32% |
74.58% |
83.44% |
太乙-CLIP-Roberta-102M-中文版 |
COCO中文测试集 |
47.10% |
78.53% |
87.84% |
太乙-CLIP-Roberta-102M-中文版 |
悟空50k数据集 |
49.18% |
81.94% |
90.27% |
使用示例
from PIL import Image
import requests
import clip
import torch
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from transformers import CLIPProcessor, CLIPModel
import numpy as np
query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎']
text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese")
text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval()
text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")
with torch.no_grad():
image_features = clip_model.get_image_features(**image)
text_features = text_encoder(text).logits
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)
logit_scale = clip_model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print(np.around(probs, 3))
引用文献
若您的研究使用了本模型,请引用我们的论文:
@article{fengshenbang,
author = {张嘉旭等},
title = {封神榜1.0:中文认知智能基础体系},
journal = {CoRR},
volume = {abs/2209.02970},
year = {2022}
}
或引用项目官网:
@misc{Fengshenbang-LM,
title={封神榜大模型},
author={IDEA-CCNL},
year={2021},
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}