OmniEmbed-v0.1开源多模态嵌入模型 - 支持跨语言文本、音视频统一表示

首页

Omniembed V0.1

由 Tevatron 开发

基于Qwen2.5-Omni-7B构建的多模态嵌入模型，支持跨语言文本、图像、音频和视频的统一嵌入表示

多模态融合

Safetensors

开源协议:MIT #多模态嵌入 #跨模态检索 #统一文档检索

下载量 2,190

发布时间 : 4/12/2025

模型简介

OmniEmbed是一个多模态嵌入模型，能够生成跨语言文本、图像、音频和视频的统一嵌入表示，为多样化应用提供高效的跨模态检索能力。

模型特点

多模态统一嵌入

支持文本、图像、音频和视频的统一嵌入表示，实现跨模态检索

跨语言能力

支持多语言文本检索，性能接近专业多语言检索模型

高性能检索

在多个基准测试中表现优异，与专业单模态模型相当

开源训练

训练数据和训练代码已在Tevatron完全开源

模型能力

文本检索

图像文档检索

视频检索

音频检索

多语言检索

使用案例

多媒体检索

视频检索

根据文本查询检索相关视频内容

在MSRVTT数据集上R@1达到51.3，优于CLIP基线

音频检索

根据文本描述检索相关音频片段

在AudioCaps数据集上R@1达到34.0，优于现有基线

文档检索

图像文档检索

从包含图像/图表的文档中检索相关信息

在VIDORE数据集上nDCG@5达到85.8

多语言检索

跨语言文本检索

在MIRACL数据集上nDCG@10达到69.1

🚀 Tevatron/OmniEmbed-v0.1

OmniEmbed 是一个强大的多模态嵌入模型，它基于 Qwen2.5-Omni-7B 构建，并使用了我们的 Tevatron 工具包。Tevatron 是一个跨规模、语言和模态的统一文档检索工具包。OmniEmbed 能够为多语言文本、图像、音频和视频生成统一的嵌入表示，从而实现有效的跨模态检索，适用于各种不同的应用场景。

📝 文本 🖼️ 图像 🎧 音频 🎥 视频 🌐 多语言

✨ 主要特性

基于强大的 Qwen2.5-Omni-7B 模型构建。
使用统一的 Tevatron 工具包，跨规模、语言和模态进行文档检索。
能够为多语言文本、图像、音频和视频生成统一的嵌入表示。
支持有效的跨模态检索，适用于多种应用场景。

📦 安装指南

文档中未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

# Import Library, Load Model and Processor
import torch
from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
from qwen_omni_utils import process_mm_info

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    'ArvinZhuang/OmniEmbed-test',
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
).to(device).eval()

processor.tokenizer.padding_side = "left"
model.padding_side = "left"

# Function to Encode Message
def encode_message(message):
    texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
    audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)

    inputs = processor(
        text=texts,
        audio=audio_inputs,
        images=image_inputs,
        videos=video_inputs,
        return_tensors="pt",
        padding="longest",
    )
    for k in inputs:
        inputs[k] = inputs[k].to(device)

    cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
    inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
    model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

    last_hidden_state = model_outputs.hidden_states[-1]
    reps = last_hidden_state[:, -1]
    reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
    return reps

高级用法

🎬 视频检索

example_query = 'Query: How to cook Mapo Tofu?'
example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4"
example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}]
video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2))

print("Similarities:", sim1.item(), sim2.item())

🎵 音频检索

example_query = 'Query: A light piano piece'
example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3"
example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}]
audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2))

print("Similarities:", sim1.item(), sim2.item())

📈 图像文档检索（图像、图表、PDF）

example_query = 'Query: How many input modality does Qwen2.5-Omni support?'
example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png"
example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())

🌍 多语言文本检索

example_query = 'Query: 氧气在空气中占比多少？'
example_text_1 = "空气是指大气层中由不同气体和各类飘浮在其中的固体与液体颗粒（大气颗粒与气溶胶）所组成的气态混合物。地球大气层的空气主要由78.1%的氮气、20.9%氧气、0.9%的氩气和1~4%的水蒸气组成，其成分并不是固定的，随着高度、气压、温度的改变和对流情况不同，局部空气的组成比例也会改变。空气在大气层（特别是对流层）中的流动形成了风和曳流、气旋、龙卷等自然现象，而空气中飘浮的颗粒则形成了云、雾、霾和沙尘暴等短期天气情况。空气在海洋和陆地之间跨区域流动所承载的湿度和热能传导也是水循环和气候变率与变化的关键一环。"
example_text_2 = "水（化学式：H2O）是一种无机化合物，在常温且无杂质中是无色[1]无味不导电的透明液体，也会通过蒸发产生气态的水蒸气（这种蒸发可以发生在任何温度下，同时取决于与空气接触的表面积和湿度差）。在标准大气压下，水的凝固点是0 °C（32 °F；273 K），沸点是100 °C（212 °F；373 K）。"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())

📚 详细文档

评估结果

基准测试	任务	指标	OmniEmbed	基线模型（分数）
BEIR - 13	文本检索	nDCG@10	58.2	MistralE5（59.0）
MIRACL	多语言检索	nDCG@10	69.1	BGE‑M3（69.2）
VIDORE	图像文档检索	nDCG@5	85.8	DSE‑QWen2（85.8）
MSRVTT	视频检索	R@1	51.3	CLIP（31.2）
AudioCaps	音频检索	R@1	34.0	*[CE](https://paperswithcode.com/sota/text - to - audio - retrieval - on - audiocaps)（23.1）