BioSimCSE-BioLinkBERT-BASE开源模型 - 免费助力生物医学文本相似度计算

首页

Biosimcse BioLinkBERT BASE

由 kamalkraj 开发

基于BioLinkBERT的生物医学句子嵌入模型，专为生物医学文本相似度计算设计

文本嵌入

Transformers

#生物医学文本嵌入 #对比学习优化 #科研文献相似度

下载量 774

发布时间 : 12/5/2022

模型简介

该模型是一个sentence-transformers模型，可将生物医学领域的句子和段落映射到768维稠密向量空间，适用于聚类、语义搜索等任务。

模型特点

生物医学领域优化

专门针对生物医学文本训练，在生物医学语义相似度任务上表现优异

对比学习训练

采用MultipleNegativesRankingLoss进行对比学习训练，优化句子嵌入质量

高效向量表示

将句子转换为768维稠密向量，便于下游任务处理

模型能力

生物医学文本相似度计算

句子嵌入生成

语义搜索

文本聚类

使用案例

生物医学研究

文献检索增强

通过语义相似度改进生物医学文献检索系统

提高相关文献检索准确率

研究结果比对

自动识别不同研究中相似或相关的发现

加速研究综述过程

临床决策支持

病例相似度分析

通过症状描述向量匹配相似病例

辅助临床决策

🚀 kamalkraj/BioSimCSE - BioLinkBERT - BASE

这是一个 sentence - transformers 模型，它能将句子和段落映射到一个 768 维的密集向量空间，可用于聚类或语义搜索等任务。

🚀 快速开始

✨ 主要特性

可将句子和段落映射到 768 维的密集向量空间。
适用于聚类和语义搜索等任务。

📦 安装指南

若要使用此模型，需先安装 sentence - transformers：

pip install -U sentence-transformers

💻 使用示例

基础用法

使用 sentence - transformers 库调用模型的示例代码如下：

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
embeddings = model.encode(sentences)
print(embeddings)

高级用法

若不使用 sentence - transformers，可按以下步骤使用模型：首先将输入传递给 Transformer 模型，然后对上下文词嵌入应用正确的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')
model = AutoModel.from_pretrained('kamalkraj/BioSimCSE-BioLinkBERT-BASE')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 详细文档

评估结果

若要对该模型进行自动评估，请参考 Sentence Embeddings Benchmark：https://seb.sbert.net

训练

该模型使用以下参数进行训练：

数据加载器： torch.utils.data.dataloader.DataLoader，长度为 7708，参数如下：

{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

损失函数： sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss，参数如下：
```
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
```

fit() 方法的参数：

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 5e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 771,
    "weight_decay": 0.01
}

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 许可证

文档中未提及相关许可证信息。

🔧 技术细节

该模型将句子和段落映射到 768 维的密集向量空间，在训练过程中使用了特定的数据加载器、损失函数和优化器参数。通过对上下文词嵌入应用池化操作，得到句子的嵌入表示。在评估方面，可通过 Sentence Embeddings Benchmark 进行自动评估。

📄 引用与作者

@inproceedings{kanakarajan-etal-2022-biosimcse,
    title = "{B}io{S}im{CSE}: {B}io{M}edical Sentence Embeddings using Contrastive learning",
    author = "Kanakarajan, Kamal raj  and
      Kundumani, Bhuvana  and
      Abraham, Abhijith  and
      Sankarasubbu, Malaikannan",
    booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.louhi-1.10",
    pages = "81--86",
    abstract = "Sentence embeddings in the form of fixed-size vectors that capture the information in the sentence as well as the context are critical components of Natural Language Processing systems. With transformer model based sentence encoders outperforming the other sentence embedding methods in the general domain, we explore the transformer based architectures to generate dense sentence embeddings in the biomedical domain. In this work, we present BioSimCSE, where we train sentence embeddings with domain specific transformer based models with biomedical texts. We assess our model{'}s performance with zero-shot and fine-tuned settings on Semantic Textual Similarity (STS) and Recognizing Question Entailment (RQE) tasks. Our BioSimCSE model using BioLinkBERT achieves state of the art (SOTA) performance on both tasks.",
}