UniME-LLaVA-OneVision-7B开源多模态模型 - 提升多模态嵌入能力的实用之选

首页

Unime LLaVA OneVision 7B

由 DeepGlint-AI 开发

UniME是一个基于多模态大模型的通用嵌入学习框架，通过文本判别知识蒸馏和硬负样本增强的指令调优策略，显著提升了多模态嵌入能力。

多模态对齐

Transformers

英语开源协议:MIT #多模态嵌入学习 #文本判别蒸馏 #硬负样本增强

下载量 376

发布时间 : 5/6/2025

模型简介

UniME旨在突破模态壁垒，通过创新的训练方法增强多模态大模型的嵌入能力，在MMEB排行榜上表现优异。

模型特点

文本判别知识蒸馏

通过解耦大模型的LLM组件，使用提示语处理文本，并基于KL散度对齐学生模型与教师模型的嵌入向量，仅微调LLM组件。

硬负样本增强

采用基于相似度阈值的假负样本过滤机制和自动选择top-k相似但不匹配样本的策略，增加训练难度，提升模型性能。

多模态嵌入优化

通过提升视觉敏感性、强化跨模态对齐和增强指令跟随能力来优化多模态系统。

模型能力

多模态嵌入学习

图像文本理解

跨模态检索

文本总结

使用案例

信息检索

跨模态检索

根据图像检索相关文本描述，或根据文本检索相关图像

在MMEB评测中表现优异

内容理解

图像内容总结

用简洁词语总结图像内容

🚀 打破模态壁垒：使用多模态大语言模型进行通用嵌入学习

本项目旨在打破模态壁垒，利用多模态大语言模型实现通用嵌入学习，在多模态任务中取得了优异的成绩，如在MMEB排行榜上名列前茅。

🚀 快速开始

环境安装

git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt
pip install transformers==4.49.0

代码示例

import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

def appply_chat_template(image=None, text=None):
    if image != None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": "Summary above image in one word:\n"},
                    ],
            }]
    elif text!= None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "text", "text": f"{text}\nSummary above sentence in one word:\n"},
                    ],
            }]
    return conversation_image

base_model_path="DeepGlint-AI/UniME-LLaVA-OneVision-7B"

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_image = [Image.open(image_path)]

transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True, torch_dtype=torch.float16)
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform.apply_chat_template([appply_chat_template(text = text)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")
inputs_image = transform.apply_chat_template([appply_chat_template(image = input_image)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score.item())

✨ 主要特性

文本判别式知识蒸馏

为了增强多模态大语言模型（MLLM）的嵌入能力，我们提出了文本判别式知识蒸馏方法。训练过程包括解耦MLLM的大语言模型（LLM）组件，并使用提示“用一个词总结上述句子”处理文本，然后通过批量相似度分布上的KL散度对齐学生模型（MLLM）和教师模型（NV - Embed V2）的嵌入。值得注意的是，在此过程中仅微调LLM组件，而其他所有参数保持冻结。

硬负样本增强指令调优

之后，我们提出了硬负样本增强指令调优方法，通过提高视觉敏感性、加强跨模态对齐和提升指令遵循能力来增强多模态系统。其核心有两个关键创新：一是使用相似度阈值的假负样本过滤机制，以消除误导性样本；二是自动硬负样本采样策略，选择前k个相似但不匹配的示例以增加训练难度。

📚 详细文档

多样化检索结果

MMEB结果

📄 许可证

本项目采用MIT许可证。

📚 引用

如果您发现本仓库有用，请使用以下BibTeX条目进行引用。

@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}

项目信息

属性	详情
模型类型	图像文本到文本
训练数据	TIGER - Lab/MMEB - train
基础模型	llava - hf/llava - onevision - qwen2 - 7b - ov - hf
评估指标	召回率
库名称	transformers