pipeline_tag: 句子相似度
language: 法语
datasets:
- stsb_multi_mt
tags:
- 文本
- 句子相似度
- 句子嵌入
- camembert-base
license: apache-2.0
model-index:
- name: Van Tuan DANG开发的sentence-flaubert-base模型
results:
- task:
name: 句子嵌入
type: 文本相似度
dataset:
name: 法语文本相似度
type: stsb_multi_mt
args: fr
metrics:
- name: 测试皮尔逊相关系数
type: 皮尔逊相关系数
value: 87.14
library_name: sentence-transformers
预训练句子嵌入模型是法语句子嵌入的最先进技术
该模型基于预训练的flaubert/flaubert_base_uncased和Siamese BERT-Networks with 'sentences-transformers'进行微调,结合Augmented SBERT方法,在stsb数据集上通过两种模型的配对采样策略实现:CrossEncoder-camembert-large和dangvantuan/sentence-camembert-large
使用方法
无需语言模型即可直接使用:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-flaubert-base")
sentences = ["一架飞机正在起飞。",
"一个男人在吹奏长笛。",
"一个男人往披萨上撒碎奶酪。",
"有人把猫扔向天花板。",
"有人正在折叠一张纸。"]
embeddings = model.encode(sentences)
评估
可在stsb的法语测试数据上评估模型:
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples = []
for df in dataset:
score = float(df['similarity_score'])/5.0
inp_example = InputExample(texts=[df['sentence1'],
df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")
测试结果:
使用sts-benchmark的皮尔逊和斯皮尔曼相关系数衡量性能:
- 测试集表现:在多个基准数据集上评估皮尔逊和斯皮尔曼相关系数
皮尔逊得分
斯皮尔曼得分
引用文献
@article{reimers2019sentence,
title={Sentence-BERT: 使用孪生BERT网络的句子嵌入},
author={Nils Reimers, Iryna Gurevych},
journal={https://arxiv.org/abs/1908.10084},
year={2019}
}
@article{martin2020camembert,
title={CamemBERT: 美味的法语语言模型},
author={Martin, Louis 等},
journal={第58届计算语言学协会年会论文集},
year={2020}
}
@article{thakur2020augmented,
title={增强型SBERT: 提升句子对评分任务中双向编码器性能的数据增强方法},
author={Thakur, Nandan 等},
journal={arXiv预印本},
pages={arXiv--2010},
year={2020}
}