Nomic Embed Multimodal 3B开源模型 - 免费部署，助力视觉文档高效检索！

首页

Nomic Embed Multimodal 3b

由 nomic-ai 开发

Nomic Embed Multimodal 3B是一款顶尖的多模态嵌入模型，专注于视觉文档检索任务，支持统一文本-图像编码，在Vidore-v2测试中达到58.8 NDCG@5的卓越性能。

文本生成图像

Safetensors

支持多种语言#视觉文档检索 #多模态嵌入 #多语言支持

下载量 3,431

发布时间 : 3/27/2025

模型简介

这是一款30亿参数的多模态嵌入模型，擅长处理视觉文档检索任务，能够直接编码交错排列的文本和图像，无需复杂预处理。

模型特点

卓越性能

在Vidore-v2测试中达到58.8 NDCG@5，超越所有同类规模的密集多模态嵌入模型

统一文本-图像编码

无需复杂预处理即可直接编码交错排列的文本和图像

先进训练方法

采用同源采样和正样本感知难负挖掘技术进行训练

多语言支持

支持英语、意大利语、法语、德语和西班牙语

模型能力

视觉文档检索

多模态嵌入

文本-图像联合编码

多语言文档处理

使用案例

研究领域

学术论文检索

捕捉论文中的公式、图表和数据表格

提高对学术内容的检索准确率

企业应用

技术文档管理

编码技术文档中的代码块、流程图和屏幕截图

提升技术文档的检索效率

财务报告分析

嵌入财务报告中的走势图、统计图和数值数据

改善财务数据的检索效果

电子商务

产品目录检索

处理产品图、规格参数和价格表

优化产品搜索体验

🚀 Nomic Embed Multimodal 3B：先进的视觉文档检索模型

nomic-embed-multimodal-3b 是一款先进的密集多模态嵌入模型，在视觉文档检索任务中表现出色：

高性能：在 Vidore-v2 上实现了 58.8 的 NDCG@5，超越了所有其他同等规模的密集多模态嵌入模型。
统一的文本 - 图像编码：无需复杂的预处理，即可直接对交错的文本和图像进行编码。
先进的架构：拥有 30 亿参数的多模态嵌入模型。
开放权重：模型权重可供研究使用。

🚀 快速开始

若要使用 nomic-embed-multimodal-3b，请从源代码安装 colpali：

pip install git+https://github.com/illuin-tech/colpali.git

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import BiQwen2_5, BiQwen2_5_Processor

model_name = "nomic-ai/nomic-embed-multimodal-3b"

model = BiQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = BiQwen2_5_Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score(list(torch.unbind(query_embeddings)), list(torch.unbind(image_embeddings)))

✨ 主要特性

性能表现

模型	平均	ESG 餐厅人工数据	经济宏观多模态数据	AXA 多模态数据	MIT 生物数据	ESG 餐厅合成数据	ESG 餐厅合成多模态数据	MIT 生物多模态数据	AXA 数据	经济宏观数据
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T - Systems ColQwen2.5 - 3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr - 2b - multi - v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

模型架构

总参数：30 亿
训练方式：基于 Qwen2.5 - VL 3B Instruct 进行微调
架构类型：具有统一文本和图像输入处理的视觉 - 语言模型
关键创新点：
- 同来源采样以创建更具挑战性的批次内负样本
- 采用正样本感知技术进行难负样本挖掘

与 RAG 工作流集成

Nomic Embed Multimodal 3B 可无缝集成到检索增强生成（RAG）工作流中：

直接文档嵌入：直接嵌入文档页面图像，跳过 OCR 和复杂处理。
更快的处理速度：消除预处理步骤，实现更快的索引。
更完整的信息：在单个嵌入中捕获文本和视觉线索。
简单的实现方式：对文本和图像使用相同的 API。

训练细节

Nomic Embed Multimodal 3B 通过以下几个关键创新点进行开发：

同来源采样：强制从同一数据集来源进行采样，创建更具挑战性的批次内负样本，防止模型学习数据集的人为特征。
难负样本挖掘：使用初始模型为每个查询检索前 k 个最近邻，然后将这些难负样本纳入训练。
正样本感知难负样本挖掘：使用 NV - Retriever 中引入的技术减少假阴性。

局限性

处理具有非常规布局或不寻常视觉元素的文档时，性能可能会有所不同。
虽然可以处理多种语言，但在英语内容上的性能最强。
处理非常大或复杂的文档时，可能需要将其分割成较小的块。
处理包含手写体或高度风格化字体的文档时，性能可能会降低。

📄 许可证

引用信息

如果您在研究或应用中发现此模型有用，请考虑引用以下文献：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}