language:
pipeline_tag: sentence-similarity
tags:
- russian
- pretraining
- embeddings
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
datasets:
- IlyaGusev/gazeta
- zloelias/lenta-ru
license: mit
base_model: cointegrated/LaBSE-en-ru
用于计算俄语句子嵌入的BERT模型。该模型基于cointegrated/LaBSE-en-ru开发,具有相同的上下文长度(512)、嵌入维度(768)和运算效率。
使用方法:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))
性能指标
在encodechka基准测试中的表现:
模型 |
CPU耗时(ms) |
GPU耗时(ms) |
体积(MB) |
平均语义相似度 |
平均语义+词序相似度 |
维度 |
sergeyzh/LaBSE-ru-turbo |
120.40 |
8.05 |
490 |
0.789 |
0.702 |
768 |
BAAI/bge-m3 |
523.40 |
22.50 |
2166 |
0.787 |
0.696 |
1024 |
intfloat/multilingual-e5-large |
506.80 |
30.80 |
2136 |
0.780 |
0.686 |
1024 |
intfloat/multilingual-e5-base |
130.61 |
14.39 |
1061 |
0.761 |
0.669 |
768 |
sergeyzh/rubert-tiny-turbo |
5.51 |
3.25 |
111 |
0.749 |
0.667 |
312 |
intfloat/multilingual-e5-small |
40.86 |
12.09 |
449 |
0.742 |
0.645 |
384 |
cointegrated/LaBSE-en-ru |
120.40 |
8.05 |
490 |
0.739 |
0.667 |
768 |
模型 |
STS |
PI |
NLI |
SA |
TI |
IA |
IC |
ICX |
NE1 |
NE2 |
sergeyzh/LaBSE-ru-turbo |
0.864 |
0.748 |
0.490 |
0.814 |
0.974 |
0.806 |
0.815 |
0.801 |
0.305 |
0.404 |
BAAI/bge-m3 |
0.864 |
0.749 |
0.510 |
0.819 |
0.973 |
0.792 |
0.809 |
0.783 |
0.240 |
0.422 |
intfloat/multilingual-e5-large |
0.862 |
0.727 |
0.473 |
0.810 |
0.979 |
0.798 |
0.819 |
0.773 |
0.224 |
0.374 |
intfloat/multilingual-e5-base |
0.835 |
0.704 |
0.459 |
0.796 |
0.964 |
0.783 |
0.802 |
0.738 |
0.235 |
0.376 |
sergeyzh/rubert-tiny-turbo |
0.828 |
0.722 |
0.476 |
0.787 |
0.955 |
0.757 |
0.780 |
0.685 |
0.305 |
0.373 |
intfloat/multilingual-e5-small |
0.822 |
0.714 |
0.457 |
0.758 |
0.957 |
0.761 |
0.779 |
0.691 |
0.234 |
0.275 |
cointegrated/LaBSE-en-ru |
0.794 |
0.659 |
0.431 |
0.761 |
0.946 |
0.766 |
0.789 |
0.769 |
0.340 |
0.414 |
在ruMTEB基准测试中的表现:
模型名称 |
指标 |
sbert_large_mt_nlu_ru |
sbert_large_nlu_ru |
LaBSE-ru-sts |
LaBSE-ru-turbo |
multilingual-e5-small |
multilingual-e5-base |
multilingual-e5-large |
CEDR分类 |
准确率 |
0.368 |
0.358 |
0.418 |
0.451 |
0.401 |
0.423 |
0.448 |
地理评论分类 |
准确率 |
0.397 |
0.400 |
0.406 |
0.438 |
0.447 |
0.461 |
0.497 |
地理评论聚类 |
V值 |
0.584 |
0.590 |
0.626 |
0.644 |
0.586 |
0.545 |
0.605 |
新闻标题分类 |
准确率 |
0.772 |
0.793 |
0.633 |
0.688 |
0.732 |
0.757 |
0.758 |
不当内容分类 |
准确率 |
0.646 |
0.625 |
0.599 |
0.615 |
0.592 |
0.588 |
0.616 |
电影评论分类 |
准确率 |
0.503 |
0.495 |
0.496 |
0.521 |
0.500 |
0.509 |
0.566 |
新闻检索 |
NDCG@10 |
0.214 |
0.111 |
0.651 |
0.694 |
0.700 |
0.702 |
0.807 |
问答重排序 |
MAP@10 |
0.561 |
0.468 |
0.688 |
0.687 |
0.715 |
0.720 |
0.756 |
问答检索 |
NDCG@10 |
0.298 |
0.124 |
0.622 |
0.657 |
0.685 |
0.696 |
0.741 |
俄语评论分类 |
准确率 |
0.589 |
0.583 |
0.599 |
0.632 |
0.612 |
0.630 |
0.653 |
俄语STS基准 |
皮尔逊相关系数 |
0.712 |
0.588 |
0.788 |
0.822 |
0.781 |
0.796 |
0.831 |
俄语科学文献分类 |
准确率 |
0.542 |
0.539 |
0.529 |
0.569 |
0.550 |
0.563 |
0.582 |
俄语科学文献聚类 |
V值 |
0.522 |
0.504 |
0.486 |
0.517 |
0.511 |
0.516 |
0.520 |
OECD分类 |
准确率 |
0.438 |
0.430 |
0.406 |
0.440 |
0.427 |
0.423 |
0.445 |
OECD聚类 |
V值 |
0.473 |
0.464 |
0.426 |
0.452 |
0.443 |
0.448 |
0.450 |
敏感话题分类 |
准确率 |
0.285 |
0.280 |
0.262 |
0.272 |
0.228 |
0.234 |
0.257 |
TERRA分类 |
平均精确率 |
0.520 |
0.502 |
0.587 |
0.585 |
0.551 |
0.550 |
0.584 |
模型名称 |
指标 |
sbert_large_mt_nlu_ru |
sbert_large_nlu_ru |
LaBSE-ru-sts |
LaBSE-ru-turbo |
multilingual-e5-small |
multilingual-e5-base |
multilingual-e5-large |
分类任务 |
准确率 |
0.554 |
0.552 |
0.524 |
0.558 |
0.551 |
0.561 |
0.588 |
聚类任务 |
V值 |
0.526 |
0.519 |
0.513 |
0.538 |
0.513 |
0.503 |
0.525 |
多标签分类 |
准确率 |
0.326 |
0.319 |
0.340 |
0.361 |
0.314 |
0.329 |
0.353 |
配对分类 |
平均精确率 |
0.520 |
0.502 |
0.587 |
0.585 |
0.551 |
0.550 |
0.584 |
重排序任务 |
MAP@10 |
0.561 |
0.468 |
0.688 |
0.687 |
0.715 |
0.720 |
0.756 |
检索任务 |
NDCG@10 |
0.256 |
0.118 |
0.637 |
0.675 |
0.697 |
0.699 |
0.774 |
语义文本相似度 |
皮尔逊相关系数 |
0.712 |
0.588 |
0.788 |
0.822 |
0.781 |
0.796 |
0.831 |