language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- clip
FG-CLIP:细粒度视觉与文本对齐模型
FG-CLIP:细粒度视觉与文本对齐
谢春雨*、王斌*、孔繁菁、李金城、梁大伟、张耕莘、冷大伟†、尹玉辉(*共同第一作者,†通讯作者)



模型框架
FG-CLIP采用两阶段训练:第一阶段利用全局级图文对实现初步细粒度对齐,第二阶段通过补充区域级描述(包括细节区域标注和正/负区域描述)进一步优化对齐效果。
快速开始 ü§ó
加载模型
import torch
from PIL import Image
from transformers import (
AutoImageProcessor,
AutoTokenizer,
AutoModelForCausalLM,
)
model_root = "qihoo360/fg-clip-large"
image_size=336
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)
检索示例
img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
walk_short_pos = True
captions=["一只猫的照片", "一只狗的照片"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
with torch.no_grad():
image_feature = model.get_image_features(image_input)
text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
稠密特征可视化
import math
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))
image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)
with torch.no_grad():
dense_image_feature = model.get_image_dense_features(image_input)
captions = ["白猫"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
text_feature = model.get_text_features(caption_input,walk_short_pos=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)
similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))
original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape)
plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('相似度热力图')
plt.axis('off')
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")
引用
若您的研究应用使用了FG-CLIP,请引用以下文献:
@article{xie2025fgclip,
title={FG-CLIP: 细粒度视觉与文本对齐},
author={谢春雨 and 王斌 and 孔繁菁 and 李金城 and 梁大伟 and 张耕莘 and 冷大伟 and 尹玉辉},
year={2025},
eprint={2505.05071},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05071},
}
代码仓库:https://github.com/360CVGroup/FG-CLIP
许可协议
本项目使用的数据集和检查点受其原始许可协议约束,使用者须遵守所有原始许可条款。
本项目内容遵循Apache 2.0许可证。