nomic-embed-multimodal-7b开源多模态嵌入模型 - 免费助力视觉文档高效检索

首页

Nomic Embed Multimodal 7b

由 nomic-ai 开发

70亿参数的多模态嵌入模型，专精于视觉文档检索任务，在Vidore-v2基准测试中表现卓越

文本生成图像

Safetensors

支持多种语言开源协议:Apache-2.0 #图文统一编码 #视觉文档检索 #多语言嵌入

下载量 741

发布时间 : 3/29/2025

模型简介

一款性能卓越的密集多模态嵌入模型，能够直接处理交错排列的文本与图像，无需复杂预处理，特别适合视觉文档检索任务

模型特点

卓越性能

在Vidore-v2基准测试中取得58.8 NDCG@5，超越所有其他密集多模态嵌入模型

图文统一编码

直接处理交错排列的文本与图像，无需复杂预处理

先进架构

70亿参数的多模态嵌入模型

完全开源

提供模型权重、训练数据和完整代码

模型能力

视觉文档检索

多模态嵌入

多语言处理

图文统一编码

使用案例

科研领域

科研论文检索

处理包含公式、图表和数据的科研论文

有效检索复杂学术内容

技术文档

技术文档管理

编码代码块、流程图和截图等技术文档内容

提升技术文档检索效率

商业应用

产品目录检索

呈现产品图、规格参数和价目表

改善电子商务体验

财务报告分析

嵌入走势图、柱状图和数值数据

加速财务数据分析

🚀 Nomic Embed Multimodal 7B：先进的视觉文档检索模型

nomic-embed-multimodal-7b 是一款先进的密集多模态嵌入模型，在视觉文档检索任务中表现卓越：

高性能：在 Vidore-v2 上实现了 58.8 的 NDCG@5，超越了所有其他密集多模态嵌入模型。
统一的文本 - 图像编码：无需复杂的预处理，可直接对交错的文本和图像进行编码。
先进的架构：拥有 70 亿参数的多模态嵌入模型。
完全开源：模型权重、训练数据和代码均公开可用。

🚀 快速开始

若要使用 nomic-embed-multimodal-7b，请从源代码安装 colpali：

pip install git+https://github.com/illuin-tech/colpali.git

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import BiQwen2_5, BiQwen2_5_Processor

model_name = "nomic-ai/nomic-embed-multimodal-7b"

model = BiQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # 若使用苹果硅芯片，则为 "mps"
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = BiQwen2_5_Processor.from_pretrained(model_name)

# 输入数据
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# 处理输入
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# 前向传播
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score(list(torch.unbind(query_embeddings)), list(torch.unbind(image_embeddings)))

✨ 主要特性

性能表现

模型	平均得分	ESG 餐厅人工数据	经济宏观多模态数据	AXA 多模态数据	MIT 生物数据	ESG 餐厅合成数据	ESG 餐厅合成多模态数据	MIT 生物多模态数据	AXA 数据	经济宏观数据
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T-Systems ColQwen2.5 - 3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr - 2b - multi - v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

模型架构

总参数：70 亿
训练方式：基于 Qwen2.5 - VL 7B Instruct 进行微调
架构类型：具有统一文本和图像输入处理的视觉 - 语言模型
关键创新点：
- 同来源采样以创建更具挑战性的批次内负样本
- 采用正样本感知技术进行难负样本挖掘

与 RAG 工作流的集成

Nomic Embed Multimodal 7B 可无缝集成到检索增强生成（RAG）工作流中：

直接文档嵌入：直接嵌入文档页面图像，跳过 OCR 和复杂处理。
更快的处理速度：消除预处理步骤，实现更快的索引。
更完整的信息：在单个嵌入中捕获文本和视觉线索。
简单的实现方式：对文本和图像使用相同的 API。

训练细节

Nomic Embed Multimodal 7B 通过以下几个关键创新点进行开发：

同来源采样：强制从同一数据集来源采样，创建更具挑战性的批次内负样本，防止模型学习数据集的伪特征。
难负样本挖掘：使用初始模型为每个查询检索前 k 个最近邻，然后将这些难负样本纳入训练。
正样本感知难负样本挖掘：使用 NV - Retriever 中引入的技术减少假阴性。

🔧 技术细节

模型基础信息

属性	详情
基础模型	Qwen/Qwen2.5 - VL - 7B - Instruct
库名称	peft
数据集	nomic - ai/colpali - queries - mined - 20250321 - by - source
支持语言	英语、意大利语、法语、德语、西班牙语
任务类型	视觉文档检索
标签	vidore、colpali、multimodal_embedding、multilingual_embedding、Text - to - Visual Document (T→VD) retrieval

📄 许可证

本项目采用 Apache 2.0 许可证。

⚠️ 局限性

处理具有非常规布局或不寻常视觉元素的文档时，性能可能会有所不同。
虽然支持多种语言，但在英语内容上的性能最强。
处理非常大或复杂的文档时，可能需要将其分割成较小的块。
处理包含手写体或高度风格化字体的文档时，性能可能会降低。

👥 加入 Nomic 社区

Nomic Embed 生态系统：https://www.nomic.ai/embed
官方网站：https://nomic.ai
Twitter：https://twitter.com/nomic_ai
Discord：https://discord.gg/myY5YDR8z8

📚 引用

如果您在研究或应用中发现此模型有用，请考虑引用以下文献：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}