LdIR-Qwen2-reranker-1.5B开源模型 - 高效助力中文医疗问答及通用文本重排序

首页

Ldir Qwen2 Reranker 1.5B

由 neofung 开发

基于Qwen2-1.5B的下游任务模型，专注于重排序任务，在中文医疗问答和通用文本重排序任务中表现优异。

文本嵌入

Transformers

支持多种语言开源协议:Apache-2.0 #中文问答重排序 #医疗信息检索 #1.5B参数规模

下载量 51

发布时间 : 8/13/2024

模型简介

该模型是基于Qwen2-1.5B开发的重排序模型，主要用于提升检索系统的相关性排序效果，特别优化了中文医疗问答场景下的性能。

模型特点

中文医疗问答优化

在CMedQA医疗问答数据集上表现出色，MAP指标达到86.5以上

多任务适配

支持多种重排序任务，包括通用文本和医疗领域

高效推理

支持FP16加速和多GPU并行计算

模型能力

文本相关性重排序

医疗问答优化

跨语言重排序

使用案例

信息检索

医疗问答系统

提升医疗问答系统中答案的排序质量

在CMedQAv1数据集上MRR达到88.91

搜索引擎优化

改进搜索引擎结果的相关性排序

在MMarco数据集上MAP达到39.35

🚀 LdIR-Qwen2-reranker-1.5B

本模型是基于Qwen/Qwen2 - 1.5B的下游任务模型。我们借鉴了FlagEmbedding reranker的工作，并使用Qwen2 - 1.5B作为预训练模型进行实现。

🚀 快速开始

依赖安装

transformers==4.41.2
flash-attn==2.5.7

代码使用

from typing import cast, List, Union, Tuple, Dict, Optional
import numpy as np
import torch
from tqdm import tqdm
import transformers
from transformers import AutoTokenizer, PreTrainedModel, PreTrainedTokenizer, DataCollatorWithPadding
from transformers.models.qwen2 import Qwen2Config, Qwen2ForSequenceClassification
from transformers.trainer_pt_utils import LabelSmoother
IGNORE_TOKEN_ID = LabelSmoother.ignore_index

def preprocess(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
    max_len: int = 1024,
) -> Dict:

    # Apply prompt templates
    input_ids, attention_masks = [], []
    for i, source in enumerate(sources):
        messages = [
            {"role": "user",
            "content": "\n\n".join(source)}
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = tokenizer([text])
        input_id = model_inputs['input_ids'][0]
        attention_mask = model_inputs['attention_mask'][0]
        if len(input_id) > max_len:
            ## last five tokens: <|im_end|>(151645), \n(198), <|im_start|>(151644), assistant(77091), \n(198)
            diff = len(input_id) - max_len
            input_id = input_id[:-5-diff] + input_id[-5:]
            attention_mask = attention_mask[:-5-diff] + attention_mask[-5:]
            assert len(input_id) == max_len
        input_ids.append(input_id)
        attention_masks.append(attention_mask)

    return dict(
        input_ids=input_ids,
        attention_mask=attention_masks
    )

class FlagRerankerCustom:
    def __init__(
            self,
            model: PreTrainedModel,
            tokenizer: PreTrainedTokenizer,
            use_fp16: bool = False
    ) -> None:
        self.tokenizer = tokenizer
        self.model = model
        self.data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

        if torch.cuda.is_available():
            self.device = torch.device('cuda')
        elif torch.backends.mps.is_available():
            self.device = torch.device('mps')
        else:
            self.device = torch.device('cpu')
            use_fp16 = False
        if use_fp16:
            self.model.half()

        self.model = self.model.to(self.device)

        self.model.eval()

        self.num_gpus = torch.cuda.device_count()
        if self.num_gpus > 1:
            print(f"----------using {self.num_gpus}*GPUs----------")
            self.model = torch.nn.DataParallel(self.model)

    @torch.no_grad()
    def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]], batch_size: int = 64,
                      max_length: int = 1024) -> List[float]:
        
        if self.num_gpus > 0:
            batch_size = batch_size * self.num_gpus

        assert isinstance(sentence_pairs, list)
        if isinstance(sentence_pairs[0], str):
            sentence_pairs = [sentence_pairs]

        all_scores = []
        for start_index in tqdm(range(0, len(sentence_pairs), batch_size), desc="Compute Scores",
                                disable=True):
            sentences_batch = sentence_pairs[start_index:start_index + batch_size]
            inputs = preprocess(sources=sentences_batch, tokenizer=self.tokenizer, max_len=max_length)
            inputs = [dict(zip(inputs, t)) for t in zip(*inputs.values())]
            inputs = self.data_collator(inputs).to(self.device)
            scores = self.model(**inputs, return_dict=True).logits
            scores = scores.squeeze()
            all_scores.extend(scores.detach().to(torch.float).cpu().numpy().tolist())

        if len(all_scores) == 1:
            return all_scores[0]
        return all_scores

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    padding_side="right",
)

config = Qwen2Config.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    trust_remote_code=True,
    bf16=True,
)

model = Qwen2ForSequenceClassification.from_pretrained(
    "neofung/LdIR-Qwen2-reranker-1.5B",
    config = config,
    trust_remote_code = True,
)

model = FlagRerankerCustom(model=model, tokenizer=tokenizer, use_fp16=False)

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

model.compute_score(pairs)

# [-2.655318021774292, 11.7670316696167]

在C - MTEB上的评估

from C_MTEB.tasks import *
from mteb import MTEB

save_name = "LdIR-Qwen2-reranker-1.5B"

evaluation = MTEB(
    task_types=["Reranking"], task_langs=['zh', 'zh2en', 'en2zh']
    )

evaluation.run(model, output_folder=f"reranker_results/{save_name}")

📊 评估结果

任务类型	数据集	评估指标	数值
重排序	MTEB CMedQAv1	MAP	86.50438688414654
重排序	MTEB CMedQAv1	MRR	88.91170634920635
重排序	MTEB CMedQAv2	MAP	87.10592353383732
重排序	MTEB CMedQAv2	MRR	89.10178571428571
重排序	MTEB MMarcoReranking	MAP	39.354813242907133
重排序	MTEB MMarcoReranking	MRR	39.075793650793655
重排序	MTEB T2Reranking	MAP	68.83696915006163
重排序	MTEB T2Reranking	MRR	79.77644651857584