LaBSE-ru-sts开源模型 - 精准计算俄语句子嵌入，适用语义文本相似度任务

首页

Labse Ru Sts

由 sergeyzh 开发

高质量俄语句子嵌入计算BERT模型，基于cointegrated/LaBSE-en-ru优化，适用于语义文本相似度任务

文本嵌入

Transformers

其他开源协议:MIT #俄语语义相似度 #多任务优化 #高效GPU推理

下载量 4,650

发布时间 : 3/24/2024

模型简介

该模型专门用于俄语语义文本相似度计算，能够生成高质量的句子嵌入向量，适用于多种自然语言处理任务

模型特点

高质量俄语嵌入

专门针对俄语优化的句子嵌入表示，在俄语语义相似度任务上表现优异

高效计算

相比大型模型具有更快的推理速度，同时保持较高的性能

768维嵌入空间

提供足够丰富的语义表示空间

512标记上下文长度

支持处理较长的文本片段

模型能力

语义文本相似度计算

句子嵌入生成

文本特征提取

复述识别

自然语言推理

使用案例

信息检索

文档相似度搜索

用于构建基于语义的文档检索系统

在新闻检索任务上NDCG@10达到0.651

文本分类

情感分析

用于俄语评论的情感分类

准确率达到0.599

问答系统

问答重排序

改进问答系统中答案的排序质量

MAP@10达到0.688

🚀 用于GPU上语义文本相似度（STS）的基础Bert模型

这是一个高质量的BERT模型，用于计算俄语句子的嵌入向量。该模型基于 cointegrated/LaBSE-en-ru 构建，具有相似的上下文长度（512）、嵌入维度（768）和性能表现。

🚀 快速开始

✨ 主要特性

专为俄语句子嵌入计算设计，适用于语义文本相似度（STS）任务。
基于成熟的 cointegrated/LaBSE-en-ru 模型，具备相似的性能指标。

📦 安装指南

在使用模型前，你需要安装必要的库：

pip install transformers sentencepiece

💻 使用示例

基础用法

使用 transformers 库调用模型：

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (768,)

高级用法

使用 sentence_transformers 库调用模型：

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

📚 详细文档

指标评估

模型在 encodechka 基准测试中的评估结果如下：

模型	STS	PI	NLI	SA	TI
intfloat/multilingual-e5-large	0.862	0.727	0.473	0.810	0.979
sergeyzh/LaBSE-ru-sts	0.845	0.737	0.481	0.805	0.957
sergeyzh/rubert-mini-sts	0.815	0.723	0.477	0.791	0.949
sergeyzh/rubert-tiny-sts	0.797	0.702	0.453	0.778	0.946
Tochka-AI/ruRoPEBert-e5-base-512	0.793	0.704	0.457	0.803	0.970
cointegrated/LaBSE-en-ru	0.794	0.659	0.431	0.761	0.946
cointegrated/rubert-tiny2	0.750	0.651	0.417	0.737	0.937

任务说明：

语义文本相似度（STS）
释义识别（PI）
自然语言推理（NLI）
情感分析（SA）
毒性识别（TI）

性能和规模

模型在 encodechka 基准测试中的性能和规模评估结果如下：

模型	CPU	GPU	大小	维度	上下文长度	词汇量
intfloat/multilingual-e5-large	149.026	15.629	2136	1024	514	250002
sergeyzh/LaBSE-ru-sts	42.835	8.561	490	768	512	55083
sergeyzh/rubert-mini-sts	6.417	5.517	123	312	2048	83828
sergeyzh/rubert-tiny-sts	3.208	3.379	111	312	2048	83828
Tochka-AI/ruRoPEBert-e5-base-512	43.314	9.338	532	768	512	69382
cointegrated/LaBSE-en-ru	42.867	8.549	490	768	512	55083
cointegrated/rubert-tiny2	3.212	3.384	111	312	2048	83828

模型在 ruMTEB 基准测试中的评估结果如下：

模型名称	指标	sbert_large_ mt_nlu_ru	sbert_large_ nlu_ru	LaBSE-ru-sts	LaBSE-ru-turbo	multilingual-e5-small	multilingual-e5-base	multilingual-e5-large
CEDRClassification	准确率	0.368	0.358	0.418	0.451	0.401	0.423	0.448
GeoreviewClassification	准确率	0.397	0.400	0.406	0.438	0.447	0.461	0.497
GeoreviewClusteringP2P	V-measure	0.584	0.590	0.626	0.644	0.586	0.545	0.605
HeadlineClassification	准确率	0.772	0.793	0.633	0.688	0.732	0.757	0.758
InappropriatenessClassification	准确率	0.646	0.625	0.599	0.615	0.592	0.588	0.616
KinopoiskClassification	准确率	0.503	0.495	0.496	0.521	0.500	0.509	0.566
RiaNewsRetrieval	NDCG@10	0.214	0.111	0.651	0.694	0.700	0.702	0.807
RuBQReranking	MAP@10	0.561	0.468	0.688	0.687	0.715	0.720	0.756
RuBQRetrieval	NDCG@10	0.298	0.124	0.622	0.657	0.685	0.696	0.741
RuReviewsClassification	准确率	0.589	0.583	0.599	0.632	0.612	0.630	0.653
RuSTSBenchmarkSTS	皮尔逊相关系数	0.712	0.588	0.788	0.822	0.781	0.796	0.831
RuSciBenchGRNTIClassification	准确率	0.542	0.539	0.529	0.569	0.550	0.563	0.582
RuSciBenchGRNTIClusteringP2P	V-measure	0.522	0.504	0.486	0.517	0.511	0.516	0.520
RuSciBenchOECDClassification	准确率	0.438	0.430	0.406	0.440	0.427	0.423	0.445
RuSciBenchOECDClusteringP2P	V-measure	0.473	0.464	0.426	0.452	0.443	0.448	0.450
SensitiveTopicsClassification	准确率	0.285	0.280	0.262	0.272	0.228	0.234	0.257
TERRaClassification	平均精度	0.520	0.502	0.587	0.585	0.551	0.550	0.584
Classification	准确率	0.554	0.552	0.524	0.558	0.551	0.561	0.588
Clustering	V-measure	0.526	0.519	0.513	0.538	0.513	0.503	0.525
MultiLabelClassification	准确率	0.326	0.319	0.340	0.361	0.314	0.329	0.353
PairClassification	平均精度	0.520	0.502	0.587	0.585	0.551	0.550	0.584
Reranking	MAP@10	0.561	0.468	0.688	0.687	0.715	0.720	0.756
Retrieval	NDCG@10	0.256	0.118	0.637	0.675	0.697	0.699	0.774
STS	皮尔逊相关系数	0.712	0.588	0.788	0.822	0.781	0.796	0.831
Average	平均值	0.494	0.438	0.582	0.604	0.588	0.594	0.630