ReT-CLIP-ViT-L-14开源模型 - 支持多模态查询，实现细粒度文档检索

首页

Ret CLIP ViT L 14

由 aimagelab 开发

ReT是一种支持多模态查询与文档检索的创新方法，通过融合视觉与文本骨干网络多层级表征实现细粒度检索。

多模态融合

Transformers

开源协议:Apache-2.0 #多模态文档检索 #循环增强Transformer #跨层级特征融合

下载量 523

发布时间 : 3/25/2025

模型简介

ReT采用基于Transformer的循环单元和sigmoid门控机制，选择性调控跨层级与跨模态信息流，可独立处理多模态查询与文档，生成用于相似度计算的潜在标记集。

模型特点

多层级特征融合

利用视觉与文本骨干网络的多层级表征，而非仅最终层特征

循环门控机制

受LSTM启发的sigmoid门控机制，动态调控跨模态信息流

多模态独立处理

可同时处理查询与文档中的图像和文本内容

细粒度相似度计算

生成潜在标记集支持细粒度的延迟交互式相似度匹配

模型能力

多模态文档检索

图像-文本联合表征

跨模态相似度计算

视觉-语言特征融合

使用案例

信息检索

跨模态知识检索

通过图文混合查询检索包含相关答案的文档

在定制版M2KR基准上验证有效性

问答系统

视觉问答支持

为VQA系统提供包含问题答案及对应图像的文档检索

支持OKVQA/E-VQA等视觉问答场景

🚀 视觉文档检索模型ReT

ReT是一种用于多模态文档检索的创新方法，支持多模态查询和文档。它突破了现有方法仅利用视觉和语言主干网络最后一层特征的局限，采用基于Transformer的循环单元，充分利用视觉和文本主干网络不同层的多级表示。该模型受LSTM设计启发，配备了Sigmoid门，可选择性地控制层与模态之间的信息流。ReT独立处理多模态查询和文档，生成用于细粒度后期交互相似度计算的潜在令牌集，能够同时处理查询和文档中的图像与文本。

🚀 快速开始

安装环境

按照仓库中的说明安装所需环境。

使用示例

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device)

# 查询
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# 文档
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

✨ 主要特性

多模态支持：支持多模态查询和文档，能够同时处理图像和文本。
多级特征利用：采用Transformer-based循环单元，利用视觉和文本主干网络不同层的多级表示。
信息流动控制：受LSTM设计启发的Sigmoid门，可选择性地控制层与模态之间的信息流。
细粒度交互：独立处理多模态查询和文档，生成潜在令牌集用于细粒度后期交互相似度计算。

📚 详细文档

模型来源

仓库：https://github.com/aimagelab/ReT
论文：Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

训练与评估

该模型在具有挑战性的M2KR基准测试的自定义版本上进行了训练和评估，具体修改如下：排除了不包含图像的MSMARCO，并为OVEN、InfoSeek、E-VQA和OKVQA的文档添加了图像。

📄 许可证

本项目采用Apache-2.0许可证。

📝 引用

如果您在研究中使用了该模型，请使用以下BibTeX引用：

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}