ReT-OpenCLIP-ViT-G-14开源模型 - 支持多模态查询与文档细粒度检索

首页

Ret OpenCLIP ViT G 14

由 aimagelab 开发

ReT是一种支持多模态查询与文档检索的创新方法，通过整合视觉与文本主干网络不同层级的多元表征实现细粒度检索。

多模态融合

Transformers

开源协议:Apache-2.0 #多模态文档检索 #循环门控Transformer #跨层级特征融合

下载量 77

发布时间 : 3/25/2025

模型简介

ReT采用基于Transformer的循环单元和Sigmoid门控机制，支持图像与文本混合输入，用于视觉文档检索任务。

模型特点

多层级特征整合

不同于传统方法仅使用最后一层特征，ReT整合视觉与文本主干网络不同层级的多元表征

Sigmoid门控机制

受LSTM启发的门控机制，选择性调控跨层级与跨模态的信息流

混合模态处理

可独立处理图像、文本或混合模态的查询和文档输入

模型能力

多模态文档检索

图像-文本联合特征提取

细粒度相似度计算

使用案例

信息检索

视觉问答文档检索

根据问题文本和参考图像检索包含答案的相关文档

在定制版M2KR基准测试中验证效果

跨模态检索

使用文本查询检索相关图像文档，或使用图像查询检索相关文本文档

🚀 ReT - 多模态文档检索模型

ReT是一种用于多模态文档检索的创新方法，支持多模态查询和文档。与仅使用视觉 - 语言主干网络最后一层特征的现有方法不同，ReT采用基于Transformer的循环单元，利用视觉和文本主干网络不同层的多级表示。该模型具有受LSTM设计启发的S形门，可选择性地控制层与模态之间的信息流。ReT独立处理多模态查询和文档，生成用于细粒度后期交互相似度计算的潜在令牌集。ReT旨在处理查询和文档中的图像和文本。为此，它在具有挑战性的M2KR基准测试的自定义版本上进行了训练和评估，并做了以下修改：排除了不包含图像的MSMARCO，并为来自OVEN、InfoSeek、E - VQA和OKVQA的文档添加了图像。

🚀 快速开始

ReT是一种用于多模态文档检索的新方法，支持多模态查询和文档。它利用Transformer架构，从视觉和文本主干网络的不同层提取多级特征。

✨ 主要特性

多模态支持：支持多模态查询和文档，能够处理图像和文本。
多级特征利用：采用Transformer循环单元，利用视觉和文本主干网络不同层的多级表示。
S形门设计：受LSTM启发的S形门，可选择性地控制层与模态之间的信息流。
细粒度交互：独立处理多模态查询和文档，生成潜在令牌集用于细粒度后期交互相似度计算。

📦 安装指南

请按照仓库中的说明安装所需环境。

💻 使用示例

基础用法

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-G-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

📚 详细文档

模型来源

仓库地址：https://github.com/aimagelab/ReT
论文：Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

数据集

训练和评估使用了自定义版本的M2KR基准测试，排除了不包含图像的MSMARCO，并为来自OVEN、InfoSeek、E - VQA和OKVQA的文档添加了图像。

模型信息

属性	详情
库名称	transformers
模型类型	视觉文档检索
基础模型	laion/CLIP - ViT - bigG - 14 - laion2B - 39B - b160k
训练数据	aimagelab/ReT - M2KR
许可证	apache - 2.0

📄 许可证

本模型使用Apache 2.0许可证。

📚 引用

如果您使用了该模型，请引用以下论文：

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}