pipeline_tag: 句子相似度
language: 法语
datasets:
- stsb_multi_mt
tags:
- 文本
- 句子相似度
- 句子嵌入
- camembert-large
license: apache-2.0
model-index:
- name: sentence-camembert-large(作者:Van Tuan DANG)
results:
- task:
name: 句子嵌入
type: 文本相似度
dataset:
name: 法语文本相似度
type: stsb_multi_mt
args: fr
metrics:
- name: 测试皮尔逊相关系数
type: 皮尔逊相关系数
value: xx.xx
library_name: sentence-transformers
描述:
Sentence-CamemBERT-Large 是由 La Javaness 开发的法语句子嵌入模型。该模型旨在将法语句子的内容和语义表示为数学向量,从而理解文本的深层含义(超越单个词汇),为语义搜索提供强大支持。
预训练句子嵌入模型是当前法语领域的最先进技术。
该模型基于预训练的 facebook/camembert-large 和 Siamese BERT-Networks(使用'sentence-transformers'库),在 stsb 数据集上微调而成。
使用方法
无需语言模型即可直接使用:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("dangvantuan/sentence-camembert-large")
sentences = [
"一架飞机正在起飞。",
"一个男人在吹奏长笛。",
"一个男人将磨碎的奶酪撒在披萨上。",
"一个人把猫扔向天花板。",
"一个人正在折叠一张纸。",
]
embeddings = model.encode(sentences)
评估
可在 stsb 的法语测试数据上按以下方式评估模型:
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples = []
for df in dataset:
score = float(df['similarity_score']) / 5.0
inp_example = InputExample(texts=[df['sentence1'], df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")
测试结果:
使用皮尔逊和斯皮尔曼相关系数衡量性能:
引用
@article{reimers2019sentence,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Nils Reimers, Iryna Gurevych},
journal={https://arxiv.org/abs/1908.10084},
year={2019}
}
@article{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}