许可协议:Apache-2.0
标签:
本文提出LLM2CLIP,一种利用大语言模型(LLM)释放CLIP潜力的创新方法。通过在对比学习框架下对LLM进行标题空间微调,我们将其文本能力提取到输出嵌入中,显著提升了输出层的文本判别性。随后设计高效训练流程,让微调后的LLM作为CLIP视觉编码器的强力教师。得益于LLM的引入,我们可以使用更长更复杂的标题,突破原始CLIP文本编码器的上下文窗口和能力限制。实验表明,该方法在跨模态任务中带来显著提升:将此前SOTA模型EVA02在长文本和短文本检索任务上的性能均提高16.5%,使仅用英语数据训练的CLIP模型蜕变为最先进的跨语言模型。此外,当与Llava 1.5等多模态模型结合时,在几乎所有基准测试中都持续超越CLIP,展现出全面的性能提升。
LLM2CLIP性能
**请注意:论文中所有结果均基于PyTorch权重评估,使用Hugging Face模型时性能可能存在差异。**
模型详情
- 模型类型:视觉基础模型,特征骨干网络
- 预训练数据集:CC3M、CC12M、YFCC15M及Recap-DataComp-1B(3000万子集)
使用方法
Huggingface版本
图像嵌入
from PIL import Image
from transformers import AutoModel
from transformers import CLIPImageProcessor
import torch
image_path = "CLIP.png"
model_name_or_path = "LLM2CLIP-Openai-B-16"
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-base-patch16")
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.float16,
trust_remote_code=True).to('cuda').eval()
image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
with torch.no_grad(), torch.cuda.amp.autocast():
outputs = model.get_image_features(input_pixels)
检索任务
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from PIL import Image
import torch
from llm2vec import LLM2Vec
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-base-patch16")
model_name_or_path = "microsoft/LLM2CLIP-Openai-B-16"
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True).to('cuda').eval()
llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)
captions = ["示意图", "狗", "猫"]
image_path = "CLIP.png"
image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("标签概率:", text_probs)
BibTeX引用
@misc{huang2024llm2clippowerfullanguagemodel,
title={LLM2CLIP:大语言模型解锁更丰富的视觉表征},
author={黄伟权 and 吴傲奇 and 杨一帆 and 罗旭芳 and 杨雨晴 and 胡亮 and 戴琪 and 戴西阳 and 陈冬冬 and 罗冲 and 邱莉莉},
year={2024},
eprint={2411.04997},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.04997},
}