pipeline_tag: 句子相似度
language: 法语
datasets:
- stsb_multi_mt
tags:
- 文本
- 句子相似度
- 句子嵌入
- camembert-base
license: apache-2.0
model-index:
- name: Van Tuan DANG开发的sentence-camembert-base模型
results:
- task:
name: 句子嵌入
type: 文本相似度
dataset:
name: 法语文本相似度
type: stsb_multi_mt
args: fr
metrics:
- name: 测试皮尔逊相关系数
type: 皮尔逊相关系数
value: 86.88
library_name: sentence-transformers
预训练句子嵌入模型是法语句子嵌入的最先进技术
本模型基于dangvantuan/sentence-camembert-base改进,通过增强SBERT方法在stsb数据集上进行微调,并采用CrossEncoder-camembert-large和dangvantuan/sentence-camembert-large两个模型进行配对采样策略优化。
使用方法
该模型可直接使用(无需语言模型)如下:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence-camembert-base")
sentences = ["一架飞机正在起飞。",
"一个男人在吹奏长笛。",
"一个男人在披萨上撒碎奶酪。",
"有人把猫扔向天花板。",
"有人正在折叠一张纸。",
]
embeddings = model.encode(sentences)
评估
该模型可在stsb法语测试数据上按如下方式评估:
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples=[]
for df in dataset:
score = float(df['similarity_score'])/5.0
inp_example = InputExample(texts=[df['sentence1'],
df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")
测试结果:
使用sts-benchmark上的Pearson和Spearman相关系数衡量性能:
- 测试集:在多个不同基准数据集上评估Pearson和Spearman相关系数:
Pearson分数
Spearman分数
引用
@article{reimers2019sentence,
title={Sentence-BERT: 使用孪生BERT网络的句子嵌入},
author={Nils Reimers, Iryna Gurevych},
journal={https://arxiv.org/abs/1908.10084},
year={2019}
}
@article{martin2020camembert,
title={CamemBERT: 美味的法语语言模型},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
journal={第58届计算语言学协会年会论文集},
year={2020}
}
@article{thakur2020augmented,
title={增强SBERT:用于改进句子对评分任务双编码器的数据增强方法},
author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
journal={arXiv预印本},
pages={arXiv--2010},
year={2020}