许可证:apache-2.0
语言:
- 英语
标签:
- 检索
- 指令
- 重排序
- MTEB
数据集:
- jhu-clsp/FollowIR-train
模型索引:
- 名称:FollowIR-7B
结果:
- 任务:
类型:InstructionRetrieval
数据集:
类型:jhu-clsp/core17-instructions
名称:MTEB Core17InstructionRetrieval
配置:默认
拆分:测试
修订版本:e39ff896cf3efbbdeeb950e6bd7c79f266995b07
指标:
- 类型:p-MRR
值:16.47851858684521
- 任务:
类型:InstructionRetrieval
数据集:
类型:jhu-clsp/news21-instructions
名称:MTEB News21InstructionRetrieval
配置:默认
拆分:测试
修订版本:e0144086b45fe31ac125e9ac1a83b6a409bb6ca6
指标:
- 类型:p-MRR
值:6.2615989256510005
- 任务:
类型:InstructionRetrieval
数据集:
类型:jhu-clsp/robust04-instructions
名称:MTEB Robust04InstructionRetrieval
配置:默认
拆分:测试
修订版本:a5a1c4fe2bc528ac12e83f8cdf82178da85d2f1d
指标:
- 类型:p-MRR
值:13.717553757582253
模型概述
FollowIR-7B 是一个经过指令微调的语言模型,用于检索中的重排序任务。它基于 Mistral-7B-Instruct-v0.2,并在 FollowIR 数据集的检索数据和人类编写的指令上进行了微调。这些指令来自 TREC 评测任务。FollowIR-7B 在遵循指令方面优于其他所有检索模型。更多细节请参阅论文。
使用方法
以下是一个计算查询-文档对相似度得分的示例:
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
)
import torch
model_name = "jhu-clsp/FollowIR-7B"
model = AutoModelForCausalLM.from_pretrained(
model_name
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
model_name, padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
token_false_id = tokenizer.get_vocab()["false"]
token_true_id = tokenizer.get_vocab()["true"]
template = """<s> [INST] You are an expert Google searcher, whose job is to determine if the following document is relevant to the query (true/false). Answer using only one word, one of those two choices.
Query: {query}
Document: {text}
Relevant (only output one word, either "true" or "false"): [/INST] """
query1 = "What movies were written by James Cameron? A relevant document would describe a movie that was written by James Cameron only and not with anyone else"
query2 = "What movies were directed by James Cameron? A relevant document would describe any movie that was directed by James Cameron"
passages = ["Avatar: The Way of Water is a 2022 American epic science fiction film co-produced and directed by James Cameron, who co-wrote the screenplay with Rick Jaffa and Amanda Silver from a story the trio wrote with Josh Friedman and Shane Salerno. Distributed by 20th Century Studios, it is the sequel to Avatar (2009) and the second installment in the Avatar film series."] * 2
prompts = [
template.format(query=query, text=text) for (query, text) in zip([query1, query2], passages)
]
tokens = tokenizer(
prompts,
padding=True,
truncation=True,
return_tensors="pt",
pad_to_multiple_of=None,
)
for key in tokens:
tokens[key] = tokens[key].cuda()
batch_scores = model(**tokens).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
print(scores)
训练
我们使用 LLaMA-Factory 对 Mistral 进行微调以创建 FollowIR-7B。在将其转换为适合其格式(输入为“查询”+“指令”嵌入模板,输出为标签,指令作为模板的开头)后,使用以下训练脚本:
#!/bin/bash
accelerate launch src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
--dataset followIR-train \
--template mistral \
--output_dir OUTPUT \
--finetuning_type lora \
--lora_target q_proj,v_proj,o_proj,k_proj \
--overwrite_cache \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 2 \
--save_steps 29 \
--learning_rate 3e-5 \
--num_train_epochs 8.0 \
--plot_loss \
--max_length 2048 \
--lora_rank 8 \
--lora_alpha 16 \
--bf16
引用
@misc{weller2024followir,
title={FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions},
author={Orion Weller and Benjamin Chang and Sean MacAvaney and Kyle Lo and Arman Cohan and Benjamin Van Durme and Dawn Lawrie and Luca Soldaini},
year={2024},
eprint={2403.15246},
archivePrefix={arXiv},
primaryClass={cs.IR}
}