best_model_ViTB16_GPT2开源跨模态模型 - 为图像免费生成自然语言描述

首页

Best Model ViTB16 GPT2

由 evlinzxxx 开发

基于视觉变换器(ViT)和GPT-2的跨模态模型，能够为输入图像生成自然语言描述

图像生成文本

Transformers

支持多种语言#多语言图像描述生成 #视觉变换器架构 #双语评估优化

下载量 15

发布时间 : 5/19/2024

模型简介

该模型结合了ViT-B/16视觉编码器和GPT-2文本解码器，专门用于图像到文本的生成任务，支持生成英语和印尼语的图像描述

模型特点

跨模态理解

能够将视觉信息转换为自然语言描述，实现图像到文本的转换

多语言支持

支持生成英语和印度尼西亚语的图像描述

预训练架构

基于强大的ViT-B/16视觉编码器和GPT-2文本解码器构建

模型能力

图像理解

多语言文本生成

视觉-语言对齐

场景描述

使用案例

辅助技术

视障人士辅助

为视障用户生成图像内容的语音描述

帮助视障用户理解视觉内容

内容管理

自动图像标注

为图像库自动生成描述性标签

提高图像检索效率

🚀 图像描述生成模型

本项目是一个图像描述生成模型，可实现图像到文本的转换，适用于图像描述生成等场景。支持印尼语和英语两种语言，使用了BLEU和ROUGE等评估指标。

🚀 快速开始

依赖安装

确保你已经安装了所需的库，你可以使用以下命令进行安装：

pip install transformers torch pillow

代码运行

以下是一个简单的运行示例，展示了如何使用预训练模型进行图像描述生成：

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2Tokenizer
import torch
from PIL import Image

# 加载预训练模型、特征提取器和分词器
model = VisionEncoderDecoderModel.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")
feature_extractor = ViTImageProcessor.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")
tokenizer = GPT2Tokenizer.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def show_image_and_captions(url):
  # get the image and display it
  display(load_image(url))
  # get the captions on various models
  our_caption = get_caption(model, image_processor, tokenizer, url)
  # print the captions
  print(f"Our caption: {our_caption}")

# 调用函数进行图像描述生成
show_image_and_captions("/content/drive/MyDrive/try/test_400/gl_16.jpg") # ['navigate around the obstacle ahead adjusting your route to bypass the parked car.']

✨ 主要特性

多语言支持：支持印尼语（id）和英语（en）两种语言。
评估指标：使用BLEU和ROUGE等评估指标来评估模型性能。
模型架构：采用Vision Transformer（ViT-B/16）作为编码器，结合GPT2作为解码器。

📦 安装指南

你可以使用以下命令安装所需的依赖库：

pip install transformers torch pillow

💻 使用示例

基础用法

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2Tokenizer
import torch
from PIL import Image

# 加载预训练模型、特征提取器和分词器
model = VisionEncoderDecoderModel.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")
feature_extractor = ViTImageProcessor.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")
tokenizer = GPT2Tokenizer.from_pretrained("evlinzxxx/best_model_ViTB16_GPT2")

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def show_image_and_captions(url):
  # get the image and display it
  display(load_image(url))
  # get the captions on various models
  our_caption = get_caption(model, image_processor, tokenizer, url)
  # print the captions
  print(f"Our caption: {our_caption}")

# 调用函数进行图像描述生成
show_image_and_captions("/content/drive/MyDrive/try/test_400/gl_16.jpg") # ['navigate around the obstacle ahead adjusting your route to bypass the parked car.']