pipeline_tag: 句子相似度
tags:
- sentence-transformers
- 特征提取
- 句子相似度
- transformers
language: ko
kf-deberta-multitask
这是一个sentence-transformers模型:它能将句子和段落映射到768维的密集向量空间,可用于聚类或语义搜索等任务。训练方法详见GitHub。
使用方法(Sentence-Transformers)
安装sentence-transformers后即可轻松使用此模型:
pip install -U sentence-transformers
然后可以这样使用模型:
from sentence_transformers import SentenceTransformer
sentences = ["你好吗?", "这是一个用于韩语句子嵌入的BERT模型。"]
model = SentenceTransformer("upskyy/kf-deberta-multitask")
embeddings = model.encode(sentences)
print(embeddings)
使用方法(HuggingFace Transformers)
如果不使用sentence-transformers,可以这样使用模型:首先将输入传递给transformer模型,然后对上下文化的词嵌入应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["你好吗?", "这是一个用于韩语句子嵌入的BERT模型。"]
tokenizer = AutoTokenizer.from_pretrained("upskyy/kf-deberta-multitask")
model = AutoModel.from_pretrained("upskyy/kf-deberta-multitask")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("句子嵌入:")
print(sentence_embeddings)
评估结果
使用KorSTS和KorNLI训练数据集进行多任务学习后,在KorSTS评估数据集上的评估结果如下。
- 余弦皮尔逊:85.75
- 余弦斯皮尔曼:86.25
- 曼哈顿皮尔逊:84.80
- 曼哈顿斯皮尔曼:85.27
- 欧几里得皮尔逊:84.79
- 欧几里得斯皮尔曼:85.25
- 点积皮尔逊:82.93
- 点积斯皮尔曼:82.86
训练
模型训练参数如下:
数据加载器:
长度为4442的sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader
,参数为:
{'batch_size': 128}
损失函数:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
,参数为:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
数据加载器:
长度为719的torch.utils.data.dataloader.DataLoader
,参数为:
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
损失函数:
sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
fit()方法的参数:
{
"epochs": 10,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 719,
"weight_decay": 0.01
}
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DebertaV2Model
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
引用与作者
@proceedings{jeon-etal-2023-kfdeberta,
title = {KF-DeBERTa: 金融领域特定预训练语言模型},
author = {Eunkwang Jeon, Jungdae Kim, Minsang Song, and Joohyun Ryu},
booktitle = {第35届人类与认知语言技术年会论文集},
moth = {十月},
year = {2023},
publisher = {韩国信息科学家与工程师学会},
url = {http://www.hclt.kr/symp/?lnb=conference},
pages = {143--148},
}
@article{ham2020kornli,
title={KorNLI和KorSTS:韩语自然语言理解的新基准数据集},
author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
journal={arXiv预印本 arXiv:2004.03289},
year={2020}
}