许可证:apache-2.0
标签:
- 句子转换器
- 句子相似度
- 特征提取
- 生物医学文献数据库
- 医学
- 生物医学
- 临床
- 现代BERT
ModernPubMedBERT
这是一个基于PubMed数据集训练的句子转换器模型。它通过嵌套表示学习(Matryoshka Representation Learning),将句子和段落映射到具有多种嵌入维度(768、512、384、256、128)的密集向量空间。这种设计允许根据应用需求灵活选择不同大小的嵌入向量,同时保持高水平的语义文本相似性、语义搜索、复述挖掘、文本分类、聚类等任务的性能。
模型详情
模型描述
- 模型类型: 句子转换器
- 最大序列长度: 8192个标记
- 输出维度: 768维
- 相似性函数: 余弦相似度
- 语言: 英语
- 许可证: apache-2.0
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) 使用Transformer模型:ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
使用方法
直接使用(句子转换器)
首先安装句子转换器库:
pip install -U sentence-transformers
然后加载此模型并运行推理:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
sentences = [
"患者被诊断为2型糖尿病",
"个体表现出高血糖和胰岛素抵抗的症状",
"转移性癌症需要积极的治疗方法"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
- 损失函数:
MatryoshkaLoss
,参数如下:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
384,
256,
128
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
框架版本
- Python:3.10.10
- 句子转换器:4.1.0
- Transformers:4.51.3
- PyTorch:2.7.0+cu128
- Accelerate:1.6.0
- Datasets:3.5.1
- Tokenizers:0.21.1
引用
BibTeX
句子转换器
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}