语言:
- 英文
缩略图: "https://pbs.twimg.com/media/FThx_rEWAAEoujW?format=jpg&name=medium"
标签:
- t5
- 对比学习
- 排序
- 解码
- 度量学习
- pytorch
- 文本生成
- 检索
许可证: "apache-2.0"
数据集:
- 维基百科
- PG19
- C4
- relic
- ChapterBreak
- HellaSwag
- ROCStories
评估指标:
- MAUVE
- 人工评估
主仓库
https://github.com/martiansideofthemoon/rankgen
RankGen是什么?
RankGen是一套编码器模型(参数规模100M-1.2B),能够将任何预训练英语语言模型的前缀和生成内容映射到共享向量空间。RankGen可用于对语言模型生成的多个完整样本进行重新排序,也可作为评分函数融入束搜索中,显著提升生成质量(MAUVE分数从0.77提升至0.85,英语写作者人工评估偏好率达75%)。RankGen还能作为密集检索器使用,在文学检索任务上达到最先进性能。
安装配置
依赖项(pip
将自动安装)
Python 3.7+、torch
(推荐CUDA版本)、transformers
安装步骤
python3.7 -m virtualenv rankgen-venv
source rankgen-venv/bin/activate
pip install rankgen
通过此链接获取数据并放置于根目录,或使用gdown
命令:
gdown --folder https://drive.google.com/drive/folders/1DRG2ess7fK3apfB-6KoHb_azMuHbsIv4
运行测试脚本验证模型加载:
python -m rankgen.test_rankgen_encoder --model_path kalpeshk2011/rankgen-t5-base-all
### 预期输出
0.0009239262409127233
0.0011521980725477804
使用指南
虽然可通过HuggingFace API直接加载(见方法二),但推荐使用RankGenEncoder
封装类,它能自动处理数据预处理和分词。您可下载我们的仓库安装API,或直接复制下方实现代码。
[推荐] 方法一:通过RankGenEncoder加载
from rankgen import RankGenEncoder, RankGenGenerator
rankgen_encoder = RankGenEncoder("kalpeshk2011/rankgen-t5-xl-all")
prefix_vectors = rankgen_encoder.encode(["这是前缀句子"], vectors_type="prefix")
suffix_vectors = rankgen_encoder.encode(["这是后缀句子"], vectors_type="suffix")
generator = RankGenGenerator(rankgen_encoder=rankgen_encoder, language_model="gpt2-medium")
inputs = ["无论悲剧性质如何,此刻早已落幕,远处西边移动的黑点必是撤离的袭击者。基思唯一能做的就是确认遇难者命运并予以安葬。除非队伍中有女性被掳为俘虏,否则生还希望渺茫。"]
print(generator.generate_single(inputs, top_p=0.9)[0][0])
print(generator.overgenerate_rerank(inputs, top_p=0.9, num_samples=10)[0][0])
print(generator.beam_search(inputs, top_p=0.9, num_samples=10, beam_size=2)[0][0])
方法二:通过HuggingFace API加载
from transformers import T5Tokenizer, AutoModel
tokenizer = T5Tokenizer.from_pretrained(f"google/t5-v1_1-xl")
model = AutoModel.from_pretrained("kalpeshk2011/rankgen-t5-xl-all", trust_remote_code=True)
RankGenEncoder实现代码
import tqdm
from transformers import T5Tokenizer, T5EncoderModel, AutoModel
class RankGenEncoder():
def __init__(self, model_path, max_batch_size=32, model_size=None, cache_dir=None):
assert model_path in ["kalpeshk2011/rankgen-t5-xl-all", "kalpeshk2011/rankgen-t5-xl-pg19", "kalpeshk2011/rankgen-t5-base-all", "kalpeshk2011/rankgen-t5-large-all"]
self.max_batch_size = max_batch_size
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
if model_size is None:
if "t5-large" in model_path or "t5_large" in model_path:
self.model_size = "large"
elif "t5-xl" in model_path or "t5_xl" in model_path:
self.model_size = "xl"
else:
self.model_size = "base"
else:
self.model_size = model_size
self.tokenizer = T5Tokenizer.from_pretrained(f"google/t5-v1_1-{self.model_size}", cache_dir=cache_dir)
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
self.model.to(self.device)
self.model.eval()
def encode(self, inputs, vectors_type="prefix", verbose=False, return_input_ids=False):
tokenizer = self.tokenizer
max_batch_size = self.max_batch_size
if isinstance(inputs, str):
inputs = [inputs]
if vectors_type == 'prefix':
inputs = ['pre ' + input for input in inputs]
max_length = 512
else:
inputs = ['suffi ' + input for input in inputs]
max_length = 128
all_embeddings = []
all_input_ids = []
for i in tqdm.tqdm(range(0, len(inputs), max_batch_size), total=(len(inputs) // max_batch_size) + 1, disable=not verbose, desc=f"编码{vectors_type}输入:"):
tokenized_inputs = tokenizer(inputs[i:i + max_batch_size], return_tensors="pt", padding=True)
for k, v in tokenized_inputs.items():
tokenized_inputs[k] = v[:, :max_length]
tokenized_inputs = tokenized_inputs.to(self.device)
with torch.inference_mode():
batch_embeddings = self.model(**tokenized_inputs)
all_embeddings.append(batch_embeddings)
if return_input_ids:
all_input_ids.extend(tokenized_inputs.input_ids.cpu().tolist())
return {
"embeddings": torch.cat(all_embeddings, dim=0),
"input_ids": all_input_ids
}