E5rope-base开源嵌入模型 - 免费支持长上下文检索任务！

首页

E5rope Base

由 dwzhu 开发

E5-RoPE-基础版是基于旋转位置嵌入（RoPE）的嵌入模型，旨在支持长上下文检索任务。

文本嵌入

Safetensors

英语开源协议:MIT #长上下文检索 #旋转位置嵌入 #句子相似度

下载量 129

发布时间 : 4/18/2024

模型简介

该模型主要用于句子相似度计算和长上下文检索任务，通过旋转位置嵌入（RoPE）技术提升对长文本的处理能力。

模型特点

旋转位置嵌入（RoPE）

使用旋转位置嵌入技术，有效处理长上下文检索任务。

高效检索

优化了嵌入模型在长上下文中的检索性能。

多任务支持

支持句子相似度计算和长上下文检索等多种任务。

模型能力

句子相似度计算

长上下文检索

文本嵌入生成

使用案例

信息检索

查询与段落匹配

用于匹配查询与相关段落，提升检索系统的准确性。

在BEIR和MTEB基准测试中表现良好。

语义相似度

句子相似度计算

计算两个句子之间的语义相似度。

🚀 E5-RoPE-Base

E5-RoPE-Base是一个用于长文本上下文检索的嵌入模型。它基于论文LongEmbed: Extending Embedding Models for Long Context Retrieval，旨在对比使用绝对位置嵌入（APE）和旋转位置嵌入（RoPE）的嵌入模型性能，展示RoPE在处理长上下文时的优势。

🚀 快速开始

本模型有12层，嵌入维度为768。下面将介绍其使用方法。

💻 使用示例

基础用法

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]
tokenizer = AutoTokenizer.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True)
model = AutoModel.from_pretrained('dwzhu/e5rope-base', trust_remote_code=True).cuda()
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

📚 详细文档

训练详情

关于模型的训练细节，请参考我们的论文 https://arxiv.org/abs/2404.12096.pdf。

基准评估

你可以参考 unilm/e5 来复现该模型在 BEIR 和 MTEB benchmark 上的评估结果。

需要注意的是，E5-RoPE-Base并非专门为优化性能而训练，其目的是对比使用绝对位置嵌入（APE）和旋转位置嵌入（RoPE）的嵌入模型性能。通过比较E5-Base和E5-RoPE-Base，我们展示了基于RoPE的嵌入模型在处理长上下文时的优越性。更多细节请参考我们的论文 LongEmbed: Extending Embedding Models for Long Context Retrieval。

📄 许可证

本项目采用MIT许可证。

📖 引用

如果你觉得我们的论文或模型有帮助，请按以下格式引用：

@article{zhu2024longembed,
  title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
  author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
  journal={arXiv preprint arXiv:2404.12096},
  year={2024}
}