LLM2Vec-Sheared-LLaMA-mntp开源模型 - 将大语言模型转为文本编码器利器

首页

Llm2vec Sheared LLaMA Mntp

由 McGill-NLP 开发

LLM2Vec是将仅解码器架构的大语言模型转化为文本编码器的简易方案，通过启用双向注意力、掩码下一词预测和无监督对比学习实现。

文本嵌入

Transformers

英语开源协议:MIT #解码器转编码器 #双向注意力机制 #无监督对比学习

下载量 2,430

发布时间 : 4/4/2024

模型简介

LLM2Vec是一种将大语言模型转换为高效文本编码器的技术方案，适用于文本相似度计算、信息检索等任务。

模型特点

双向注意力机制

通过启用双向注意力，使模型能够更好地理解上下文信息。

掩码下一词预测

采用掩码下一词预测技术提升模型的文本理解能力。

无监督对比学习

利用无监督对比学习优化模型性能，无需大量标注数据。

简易转换方案

仅需三个简单步骤即可将解码器LLM转换为高效文本编码器。

模型能力

文本嵌入

文本语义相似度计算

信息检索

文本分类

文本聚类

特征提取

使用案例

信息检索

网页搜索查询匹配

根据用户查询检索相关段落

高准确度的查询-文档匹配

文本分析

文档相似度分析

计算不同文档间的语义相似度

有效的文档聚类和分类

🚀 LLM2Vec：大语言模型是强大的文本编码器

LLM2Vec 是一种将仅解码器的大语言模型转换为文本编码器的简单方法。它包含三个简单步骤：1) 启用双向注意力；2) 掩码下一个词预测；3) 无监督对比学习。该模型还可以进一步微调以达到最先进的性能。

🚀 快速开始

LLM2Vec 提供了一种简洁有效的方式，将仅解码器的大语言模型转化为强大的文本编码器。通过几个简单步骤，你就可以使用它进行文本编码和相似度计算。

📦 安装指南

使用以下命令安装 llm2vec：

pip install llm2vec

💻 使用示例

基础用法

from llm2vec import LLM2Vec

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel

# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs.
tokenizer = AutoTokenizer.from_pretrained(
    "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp"
)
config = AutoConfig.from_pretrained(
    "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)

# Loading MNTP (Masked Next Token Prediction) model.
model = PeftModel.from_pretrained(
    model,
    "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
)

# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.8180, 0.5825],
        [0.1069, 0.1931]])
"""