mxbai-rerank-large-v2-seq开源句子转换器 - 支持多语言文本排序任务

首页

Mxbai Rerank Large V2 Seq

由 michaelfeil 开发

一个支持多种语言的句子转换器模型，适用于文本排序任务

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #多语言文本排序 #句子转换器 #跨语言检索

下载量 210

发布时间 : 3/14/2025

模型简介

该模型是一个多语言句子转换器，能够处理多种语言的文本排序任务。它支持包括英语、中文、德语、日语等在内的14种语言，适用于跨语言文本处理和信息检索场景。

模型特点

多语言支持

支持14种语言的文本处理，包括主要欧洲和亚洲语言

文本排序能力

专门优化用于文本排序任务，能够有效比较和排序句子

基于Transformer

采用先进的Transformer架构，提供高质量的文本表示

模型能力

多语言文本处理

句子嵌入生成

文本相似度计算

跨语言信息检索

使用案例

信息检索

跨语言文档搜索

在多语言文档集合中查找相关文档

能够有效匹配不同语言的相似内容

推荐系统

多语言内容推荐

基于用户历史行为推荐多语言相关内容

提升跨语言用户体验

🚀 重写为分类器的Mixedbread重排器

本仓库是将Mixedbread重排器重写为分类器的项目，截至2025年3月，它是最强大的重排器，例如可用于检索增强生成（RAG）。

🚀 快速开始

FP8在NVIDIA L4/H100上的部署

以下是部署所需的配置文件示例：

build_commands: []
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input:
    input: 'ERROR: This redirects to the embedding endpoint. Use the /sync API to
      reach /sync/predict'
model_name: BEI-mixedbread-ai-mxbai-rerank-base-v2-reranker-fp8-truss-example
python_version: py39
requirements: []
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
secrets: {}
system_packages: []
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/mxbai-rerank-large-v2-seq
      revision: main
      source: HF
    max_num_tokens: 32768
    max_seq_len: 1000001
    num_builder_gpus: 4
    quantization_type: fp8

要将部署推送到Baseten.co，请执行以下操作：

pip install truss --upgrade
nano config.yaml # 编辑上述配置文件
truss push --publish

更多信息请参考： https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/BEI-mixedbread-ai-mxbai-rerank-base-v2-reranker-fp8

作为分类器使用

若要在Baseten.co或github.com/michaelfeil/infinity上使用，你需要使用分类API。你需要手动创建以下特定于该模型的提示模板，该模板遵循https://github.com/mixedbread-ai/mxbai-rerank/tree/main 上的参考实现。

def create_mxbai_v2_reranker_prompt_template(query: str, document: str, instruction: str = "") -> str:
    """
    Create a carefully formatted chat template string (without tokenizer) for ranking relevance.

    Parameters:
        query (str): The search query.
        document (str): The document text to evaluate.
        instruction (str): Special instructions (e.g., "You are an expert for Mockingbirds.")

    Returns:
        str: The formatted chat template.
    """
    instruction = f"instruction: {instruction}\n" if instruction else ""    
    # fixed system prompt, keep as is.
    system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
    assert not "\n" in system_prompt
    assert not "\n" in instruction[:-1]
    assert isinstance(query, str)
    assert isinstance(document, str)
    templated = (
        # keep spacing, newlines as is.
        # template for mixedbread reranker v2
        # https://huggingface.co/michaelfeil/mxbai-rerank-base-v2-seq/
       f"<|endoftext|><|im_start|>system\n{system_prompt}\n"
        "<|im_end|>\n"
        "<|im_start|>user\n"
        f"{instruction}"
        f"query: {query} \n"
        f"document: {document} \n"
        "You are a search relevance expert who evaluates how well documents match search queries. "
        "For each query-document pair, carefully analyze the semantic relationship between them, then provide your binary relevance judgment (0 for not relevant, 1 for relevant).\n"
        "Relevance:<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    return templated

✨ 主要特性

先进的性能和高效性：具备最先进的性能和强大的效率。
多语言支持：支持100多种语言，在英语和中文方面表现出色。
代码支持：支持代码相关的处理。
长上下文支持：能够处理长上下文信息。

📦 安装指南

安装mxbai-rerank：

pip install mxbai-rerank

💻 使用示例

基础用法

from mxbai_rerank import MxbaiRerankV2

model = MxbaiRerankV2("mixedbread-ai/mxbai-rerank-large-v2")

query = "Who wrote 'To Kill a Mockingbird'?"
documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]

# Lets get the scores
results = model.rank(query, documents, return_documents=True, top_k=3)

print(results)

📚 详细文档

性能

基准测试结果

模型	BEIR平均得分	多语言得分	中文得分	代码搜索得分	延迟（秒）
mxbai-rerank-large-v2	57.49	29.79	84.16	32.05	0.89
mxbai-rerank-base-v2	55.57	28.56	83.70	31.73	0.67
mxbai-rerank-large-v1	49.32	21.88	72.53	30.72	2.24

*延迟在A100 GPU上测量

训练细节

模型采用三步训练过程：

GRPO（引导式强化提示优化）
对比学习
偏好学习

更多详细信息，请查看我们的技术博客文章。相关论文即将发布。

📄 许可证

引用

@online{rerank2025mxbai,
  title={Every Byte Matters: Introducing mxbai-embed-xsmall-v1},
  author={Sean Lee and Aamir Shakir and Julius Lipp and Rui Huang},
  year={2025},
  url={https://www.mixedbread.com/blog/mxbai-rerank-v2},
}