license: mit
pipeline_tag: image-to-text
BLIP-Large模型的Mocha微调版本
这是BLIP-Large模型的官方微调版本,采用MOCHa强化学习框架在MS-COCO数据集上进行微调,相关技术细节请参阅论文《缓解开放词汇描述幻觉问题》Mitigating Open-Vocabulary Caption Hallucinations
项目主页
使用说明
本模型支持条件式与非条件式图像描述生成
使用PyTorch模型
在CPU上运行
点击展开
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "一张"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
在GPU上运行
全精度模式
点击展开
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "一张"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
半精度模式(float16
)
点击展开
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "一张"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 日落时分的海滩上有一位女士和一只狗
参考文献:
@misc{benkish2024mitigating,
title={缓解开放词汇描述幻觉问题},
author={阿萨夫·本-基什 和 莫兰·亚努卡 和 莫里斯·阿尔珀 和 拉贾·吉尔耶斯 和 哈达尔·阿弗布赫-埃洛尔},
year={2024},
eprint={2312.03631},
archivePrefix={arXiv},
primaryClass={cs.CV}
}