pipeline_tag: 句子相似度
tags:
- 句子转换器
- 特征提取
- 句子相似度
- 转换器
language:
- 韩语
widget:
- source_sentence: "那家餐厅有苍蝇飞"
sentences:
- "那家餐厅没有顾客"
- "那家餐厅在飞无人机"
- "苍蝇在餐厅里飞来飞去"
example_title: "餐厅示例"
- source_sentence: "困意袭来"
sentences:
- "毫无睡意"
- "开始犯困"
- "火车进站了"
example_title: "睡意示例"
snunlp/KR-SBERT-V40K-klueNLI-augSTS
这是一个基于sentence-transformers的模型:能够将句子和段落映射到768维稠密向量空间,适用于聚类或语义搜索等任务。
使用方法(Sentence-Transformers)
安装sentence-transformers后即可轻松使用:
pip install -U sentence-transformers
使用示例:
from sentence_transformers import SentenceTransformer
sentences = ["这是一个示例句子", "每个句子都会被转换"]
model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
embeddings = model.encode(sentences)
print(embeddings)
使用方法(HuggingFace Transformers)
若不使用sentence-transformers,可通过以下方式调用:
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['这是一个示例句子', '每个句子都会被转换']
tokenizer = AutoTokenizer.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
model = AutoModel.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("句子嵌入表示:")
print(sentence_embeddings)
模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
文档分类应用
Google Colab教程:https://colab.research.google.com/drive/1S6WSjOx9h6Wh_rX1Z2UXwx9i_uHLlOiM
模型 |
准确率 |
KR-SBERT-Medium-NLI-STS |
0.8400 |
KR-SBERT-V40K-NLI-STS |
0.8400 |
KR-SBERT-V40K-NLI-augSTS |
0.8511 |
KR-SBERT-V40K-klueNLI-augSTS |
0.8628 |
引用文献
@misc{kr-sbert,
author = {Park, Suzi and Hyopil Shin},
title = {KR-SBERT: 韩语专用预训练句子BERT模型},
year = {2021},
publisher = {GitHub},
journal = {GitHub仓库},
howpublished = {\url{https://github.com/snunlp/KR-SBERT}}
}