FollowIR-7B开源指令检索模型 - 专注检索重排序，高效提升检索效果

首页

Followir 7B

由 jhu-clsp 开发

FollowIR-7B 是一个基于 Mistral-7B-Instruct-v0.2 微调的指令检索模型，专注于检索任务中的重排序功能。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #指令检索重排序 #TREC评测优化 #查询文档相关性评估

下载量 39

发布时间 : 3/17/2024

模型简介

FollowIR-7B 是一个经过指令微调的语言模型，用于检索中的重排序任务。它在 FollowIR 数据集的检索数据和人类编写的指令上进行了微调，这些指令来自 TREC 评测任务。

模型特点

指令微调

模型在 FollowIR 数据集的检索数据和人类编写的指令上进行了微调，使其能够更好地理解和执行检索任务中的指令。

重排序能力

专注于检索任务中的重排序功能，能够根据指令对检索结果进行优化排序。

高性能

在多个指令检索任务中表现出色，优于其他检索模型。

模型能力

指令检索

重排序

查询-文档相似度计算

使用案例

信息检索

电影信息检索

根据查询指令检索与特定导演或编剧相关的电影信息。

能够准确识别与查询指令相关的文档，如识别由 James Cameron 执导的电影。

🚀 FollowIR-7B

FollowIR-7B 是一个经过指令调优的语言模型，用于检索重排序。它基于 Mistral-7B-Instruct-v0.2，在 FollowIR 数据集中带有指令的检索数据上进行了微调。这些指令来自 TREC 赛道，由人工编写。FollowIR-7B 在遵循指令方面优于所有其他检索模型。更多详情请参阅论文。

✨ 主要特性

指令调优：基于 Mistral-7B-Instruct-v0.2 进行微调，能更好地遵循指令。
性能优越：在遵循指令方面优于其他检索模型。

📦 安装指南

文档未提供安装步骤，暂不展示。

💻 使用示例

基础用法

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
import torch

# model loading and setup
model_name = "jhu-clsp/FollowIR-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_name
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
    model_name, padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
token_false_id = tokenizer.get_vocab()["false"]
token_true_id = tokenizer.get_vocab()["true"]
template = """<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.

Query: {query}
Document: {text}
Relevant (only output one word, either "true" or "false"): [/INST] """


## Lets define some example queries with instructions in the query and the passage
query1 = "What movies were written by James Cameron? A relevant document would describe a movie that was written by James Cameron only and not with anyone else"
query2 = "What movies were directed by James Cameron? A relevant document would describe any movie that was directed by James Cameron"
passages = ["Avatar: The Way of Water is a 2022 American epic science fiction film co-produced and directed by James Cameron, who co-wrote the screenplay with Rick Jaffa and Amanda Silver from a story the trio wrote with Josh Friedman and Shane Salerno. Distributed by 20th Century Studios, it is the sequel to Avatar (2009) and the second installment in the Avatar film series."] * 2

prompts = [
    template.format(query=query, text=text) for (query, text) in zip([query1, query2], passages)
]
tokens = tokenizer(
    prompts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    pad_to_multiple_of=None,
)

# move to cuda if desired
for key in tokens:
    tokens[key] = tokens[key].cuda()

# calculate the scores by comparing true and false tokens
batch_scores = model(**tokens).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
print(scores) # [0.0020704232156276703, 0.9999990463256836] first document is not relevant, as expected

📚 详细文档

模型信息

属性	详情
模型类型	经过指令调优的语言模型，用于检索重排序
训练数据	jhu-clsp/FollowIR-train

模型结果

任务类型	数据集名称	p-MRR 值
InstructionRetrieval	MTEB Core17InstructionRetrieval	16.47851858684521
InstructionRetrieval	MTEB News21InstructionRetrieval	6.2615989256510005
InstructionRetrieval	MTEB Robust04InstructionRetrieval	13.717553757582253

🔧 技术细节

我们使用 LLaMA-Factory 对 Mistral 进行微调以创建 FollowIR-7B。在微调之前，我们将数据转换为适合其格式（模板中输入“查询” + “指令”，输出为标签，且指令位于模板开头），并使用以下训练脚本：

#!/bin/bash
accelerate launch src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
    --dataset followIR-train \
    --template mistral \
    --output_dir OUTPUT \
    --finetuning_type lora \
    --lora_target q_proj,v_proj,o_proj,k_proj \
    --overwrite_cache \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 2 \
    --save_steps 29 \
    --learning_rate 3e-5 \
    --num_train_epochs 8.0 \
    --plot_loss \
    --max_length 2048 \
    --lora_rank 8 \
    --lora_alpha 16 \
    --bf16

📄 许可证

本项目采用 Apache-2.0 许可证。

📖 引用

@misc{weller2024followir,
      title={FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions}, 
      author={Orion Weller and Benjamin Chang and Sean MacAvaney and Kyle Lo and Arman Cohan and Benjamin Van Durme and Dawn Lawrie and Luca Soldaini},
      year={2024},
      eprint={2403.15246},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}