🚀 BERT/MPnet基础模型(无大小写区分)
本模型是一个句子转换器模型,它可以将句子和段落映射到768维的密集向量空间,可用于聚类或语义搜索等任务。
🚀 快速开始
📦 安装指南
若已安装句子转换器,使用该模型会非常便捷:
pip install -U sentence-transformers
💻 使用示例
基础用法
使用sentence-transformers
库的示例代码如下:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('abbasgolestani/ag-nli-bert-mpnet-base-uncased-sentence-similarity-v1') nli-mpnet-base-v2
sentences1 = ['I am honored to be given the opportunity to help make our company better',
'I love my job and what I do here',
'I am excited about our company’s vision']
sentences2 = ['I am hopeful about the future of our company',
'My work is aligning with my passion',
'Definitely our company vision will be the next breakthrough to change the world and I’m so happy and proud to work here']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1, embeddings2)
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
高级用法
若未安装句子转换器,可按以下方式使用该模型:首先将输入传递给转换器模型,然后对上下文词嵌入应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('abbasgolestani/ag-nli-bert-mpnet-base-uncased-sentence-similarity-v1')
model = AutoModel.from_pretrained('abbasgolestani/ag-nli-bert-mpnet-base-uncased-sentence-similarity-v1')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
📚 详细文档
🔍 评估结果
该模型在包含1000个句子对的本地数据集上进行了评估,此算法在该数据集上的准确率达到了82%。
🔧 技术细节
训练参数
模型使用以下参数进行训练:
- 数据加载器:
torch.utils.data.dataloader.DataLoader
,长度为7,参数如下:{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
- 损失函数:
sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
fit()
方法的参数如下:{
"epochs": 1,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 100,
"weight_decay": 0.01
}
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
📄 许可证
文档中未提及许可证相关信息。
📖 引用与作者
文档中未提供更多相关信息。