SapBERT开源生物医学实体表征模型 - 精准捕捉语义关系助力医学研究

首页

Sapbert From PubMedBERT Fulltext Mean Token

由 cambridgeltl 开发

基于PubMedBERT的生物医学实体表征模型，通过自对齐预训练优化语义关系捕捉

文本嵌入 #生物医学实体链接 #同义关系建模 #UMLS本体库预训练

下载量 244.39k

发布时间 : 3/2/2022

模型简介

SapBERT是基于PubMedBERT架构的生物医学实体表征模型，专门针对生物医学领域的细粒度语义关系进行优化，特别适用于实体链接等需要建模同义关系的任务。

模型特点

自对齐预训练

通过专门设计的度量学习框架，利用UMLS生物医学本体库优化实体表征空间

一体化解决方案

为医学实体链接(MEL)问题提供端到端解决方案，无需复杂的流水线系统

跨语言扩展

具备跨语言扩展能力，相关研究在ACL 2021和NAACL 2021发表

模型能力

生物医学实体表征

语义关系建模

实体链接

同义词识别

使用案例

医学信息处理

医学实体链接

将不同来源的医学术语链接到统一医学语言系统(UMLS)中的标准概念

在六个MEL基准数据集上实现了最新最优性能

科学文献分析

分析科学文献中的生物医学术语关系

即使没有任务特定监督，仍能达到最优水平

🚀 SapBERT-PubMedBERT

SapBERT是一个用于生物医学实体表示的预训练模型，通过自对齐预训练方案，能够有效捕捉生物医学领域的细粒度语义关系，在多个医学实体链接基准数据集上取得了最先进的成果。

🚀 快速开始

模型信息

SapBERT由Liu et al. (2020)提出。该模型使用UMLS 2020AA（仅英文）进行训练，以microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext为基础模型。请使用输出的均值池化作为表示。

模型新闻

[新闻] SapBERT的跨语言扩展版本将在ACL 2021主会议上亮相！
[新闻] SapBERT将出现在NAACL 2021的会议论文集中！

💻 使用示例

基础用法

以下脚本将字符串列表（实体名称）转换为嵌入向量：

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token")  
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0].mean(1)# use mean pooling representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

📄 许可证

引用信息

如果您使用了该模型，请引用以下论文：

@inproceedings{liu-etal-2021-self,
    title = "Self-Alignment Pretraining for Biomedical Entity Representations",
    author = "Liu, Fangyu  and
      Shareghi, Ehsan  and
      Meng, Zaiqiao  and
      Basaldella, Marco  and
      Collier, Nigel",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.334",
    pages = "4228--4238",
    abstract = "Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.",
}