LLM2Vec开源文本编码模型 - 免费将大语言模型转为编码器实现文本编码

首页

Llm2vec Meta Llama 31 8B Instruct Mntp Unsup Simcse

由 McGill-NLP 开发

LLM2Vec是一种将仅解码器架构的大语言模型转换为文本编码器的方案，通过启用双向注意力、掩码下一词预测和无监督对比学习实现转换。

文本嵌入

Safetensors

英语开源协议:MIT #解码器转编码器 #无监督对比学习 #指令感知嵌入

下载量 55

发布时间 : 10/8/2024

模型简介

该模型通过三步转换方案将大语言模型转变为文本编码器，支持文本嵌入、信息检索等任务，并可进一步微调提升性能。

模型特点

双向注意力机制

通过启用双向注意力机制增强模型对上下文的理解能力

无监督对比学习

采用无监督对比学习方法提升文本表示质量

微调兼容性

支持进一步微调以达到业界领先性能水平

模型能力

文本嵌入生成

信息检索

文本语义相似度计算

文本分类

文本聚类

使用案例

信息检索

网络搜索查询匹配

将用户查询与相关文档进行匹配检索

示例显示查询与相关文档的余弦相似度达到0.6

问答系统

蛋白质摄入量问答

回答关于女性每日蛋白质摄入量的问题

模型能准确匹配CDC指南相关内容

🚀 LLM2Vec

LLM2Vec是一种将仅解码器大型语言模型（LLMs）转换为文本编码器的简单方法。它包含三个简单步骤：1) 启用双向注意力；2) 掩码下一个标记预测；3) 无监督对比学习。该模型还可以进一步微调以达到最先进的性能。

🚀 快速开始

LLM2Vec是一种将仅解码器的大型语言模型（LLMs）转换为文本编码器的简单方法。它由三个简单步骤组成，并且可以进一步微调以实现最先进的性能。

代码仓库：https://github.com/McGill-NLP/llm2vec
论文链接：https://arxiv.org/abs/2404.05961

📦 安装指南

pip install llm2vec

💻 使用示例

基础用法

from llm2vec import LLM2Vec

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel

# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
tokenizer = AutoTokenizer.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp"
)
config = AutoConfig.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(
    model,
    "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
)
model = model.merge_and_unload()  # This can take several minutes on cpu

# Loading unsupervised SimCSE model. This loads the trained LoRA weights on top of MNTP model. Hence the final weights are -- Base model + MNTP (LoRA) + SimCSE (LoRA).
model = PeftModel.from_pretrained(
    model, "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse"
)

# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.6007, 0.3518],
        [0.4131, 0.4855]])
"""