🚀 SentenceTransformer
SentenceTransformer 是一个经过训练的 sentence-transformers 模型。它可以将句子和段落映射到一个 1024 维的密集向量空间,可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。
🚀 快速开始
直接使用(Sentence Transformers)
首先安装 Sentence Transformers 库:
pip install -U sentence-transformers
然后你可以加载这个模型并进行推理。
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("GreenNode/GreenNode-Embedding-Large-VN-Mixed-V1")
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
✨ 主要特性
- 多用途:可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。
- 高维映射:能够将句子和段落映射到 1024 维的密集向量空间。
- 支持多语言:支持越南语等多种语言。
📦 安装指南
pip install -U sentence-transformers
💻 使用示例
基础用法
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("GreenNode/GreenNode-Embedding-Large-VN-Mixed-V1")
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 详细文档
模型详情
模型描述
模型来源
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
评估
表:各模型在 GreenNodeTableRetrieval 上的性能比较
数据集:GreenNode/GreenNode-Table-Markdown-Retrieval
模型名称 |
MAP@5 ↑ |
MRR@5 ↑ |
NDCG@5 ↑ |
Recall@5 ↑ |
Mean ↑ |
多语言嵌入模型 |
|
|
|
|
|
me5_small |
33.75 |
33.75 |
35.68 |
41.49 |
36.17 |
me5_large |
38.16 |
38.16 |
40.27 |
46.62 |
40.80 |
M3-Embedding |
36.52 |
36.52 |
38.60 |
44.84 |
39.12 |
OpenAI-embedding-v3 |
30.61 |
30.61 |
32.57 |
38.46 |
33.06 |
越南语嵌入模型(先前工作) |
|
|
|
|
|
halong-embedding |
32.15 |
32.15 |
34.13 |
40.09 |
34.63 |
sup-SimCSE-VietNamese-phobert_base |
10.90 |
10.90 |
12.03 |
15.41 |
12.31 |
vietnamese-bi-encoder |
13.61 |
13.61 |
14.63 |
17.68 |
14.89 |
GreenNode-Embedding(我们的工作) |
|
|
|
|
|
M3-GN-VN |
41.85 |
41.85 |
44.15 |
57.05 |
46.23 |
M3-GN-VN-Mixed |
42.08 |
42.08 |
44.33 |
51.06 |
44.89 |
表:各模型在 ZacLegalTextRetrieval 上的性能比较
数据集:GreenNode/zalo-ai-legal-text-retrieval-vn
模型名称 |
MAP@5 ↑ |
MRR@5 ↑ |
NDCG@5 ↑ |
Recall@5 ↑ |
Mean ↑ |
多语言嵌入模型 |
|
|
|
|
|
me5_small |
54.68 |
54.37 |
58.32 |
69.16 |
59.13 |
me5_large |
60.14 |
59.62 |
64.17 |
76.02 |
64.99 |
M3-Embedding |
69.34 |
68.96 |
73.70 |
86.68 |
74.67 |
OpenAI-embedding-v3 |
38.68 |
38.80 |
41.53 |
49.94 |
41.74 |
越南语嵌入模型(先前工作) |
|
|
|
|
|
halong-embedding |
52.57 |
52.28 |
56.64 |
68.72 |
57.55 |
sup-SimCSE-VietNamese-phobert_base |
25.15 |
25.07 |
27.81 |
35.79 |
28.46 |
vietnamese-bi-encoder |
54.88 |
54.47 |
59.10 |
79.51 |
61.99 |
GreenNode-Embedding(我们的工作) |
|
|
|
|
|
M3-GN-VN |
65.03 |
64.80 |
69.19 |
81.66 |
70.17 |
M3-GN-VN-Mixed |
69.75 |
69.28 |
74.01 |
86.74 |
74.95 |
表:各模型在 VieQuADRetrieval 上的性能比较
数据集:taidng/UIT-ViQuAD2.0
模型名称 |
MAP@5 ↑ |
MRR@5 ↑ |
NDCG@5 ↑ |
Recall@5 ↑ |
Mean ↑ |
多语言嵌入模型 |
|
|
|
|
|
me5_small |
40.42 |
69.21 |
50.05 |
50.71 |
52.60 |
me5_large |
44.18 |
67.81 |
53.04 |
55.86 |
55.22 |
M3-Embedding |
44.08 |
72.28 |
54.07 |
56.01 |
56.61 |
OpenAI-embedding-v3 |
32.39 |
53.97 |
40.48 |
43.02 |
42.47 |
越南语嵌入模型(先前工作) |
|
|
|
|
|
halong-embedding |
39.42 |
62.31 |
48.63 |
52.73 |
50.77 |
sup-SimCSE-VietNamese-phobert_base |
20.45 |
35.99 |
26.73 |
29.59 |
28.19 |
vietnamese-bi-encoder |
31.89 |
54.62 |
40.26 |
42.53 |
42.33 |
GreenNode-Embedding(我们的工作) |
|
|
|
|
|
M3-GN-VN |
42.85 |
71.98 |
52.90 |
54.25 |
55.50 |
M3-GN-VN-Mixed |
44.20 |
72.64 |
54.30 |
56.30 |
56.86 |
表:各模型在 GreenNodeTableRetrieval 上的命中率比较
模型名称 |
Hit Rate@1 ↑ |
Hit Rate@5 ↑ |
Hit Rate@10 ↑ |
Hit Rate@20 ↑ |
多语言嵌入模型 |
|
|
|
|
me5_small |
38.99 |
53.37 |
59.28 |
65.09 |
me5_large |
43.99 |
59.74 |
65.74 |
71.59 |
bge-m3 |
42.15 |
57.00 |
63.05 |
68.96 |
OpenAI-embedding-v3 |
- |
- |
- |
- |
越南语嵌入模型(先前工作) |
|
|
|
|
halong-embedding |
37.22 |
52.49 |
58.57 |
64.64 |
sup-SimCSE-VietNamese-phobert_base |
14.00 |
24.74 |
30.32 |
36.44 |
vietnamese-bi-encoder |
16.89 |
25.94 |
30.50 |
35.70 |
GreenNode-Embedding(我们的工作) |
|
|
|
|
M3-GN-VN |
48.31 |
64.60 |
70.83 |
76.46 |
M3-GN-VN-Mixed |
47.94 |
64.24 |
70.43 |
76.14 |
框架版本
- Python: 3.10.14
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.1
- Accelerate: 0.33.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
🔗 关注我们
https://x.com/greennode23
🛠️ 支持
https://discord.gg/B6MJFM3J3a
📄 许可证
本仓库和模型权重遵循 MIT 许可证。
📧 联系我们
- 一般咨询与合作:tung.vu@greennode.ai, thuvt@greennode.ai
- 技术问题:viethq5@greennode.ai