BioLORD-STAMB2-v1开源模型 - 免费部署实现临床语句和生物医学概念语义表征

首页

Biolord STAMB2 V1

由 FremyCompany 开发

BioLORD是一种为临床语句和生物医学概念生成语义化表征的新型预训练策略模型

文本嵌入

PyTorch

英语开源协议:其他 #生物医学语义嵌入 #临床术语相似度 #本体表征学习

下载量 49

发布时间 : 10/20/2022

模型简介

该模型通过将概念表征锚定于定义及生物医学本体论衍生的简短描述，生成更贴合本体层次结构的语义化表征，适用于处理电子健康记录（EHR）或临床笔记等医疗文档。

模型特点

语义化表征生成

通过锚定概念定义和本体论描述，生成符合生物医学本体层次结构的语义化表征

生物医学领域优化

专为生物医学领域微调，能高效处理临床文档和医学术语

多任务支持

同时支持临床语句和生物医学概念的相似度计算

模型能力

句子相似度计算

生物医学概念表征生成

临床文档特征提取

文本聚类

语义搜索

使用案例

临床医学

医学术语匹配

识别不同表达方式但指向同一医学概念的术语

在MayoSRS数据集上达到先进水平

电子健康记录分析

从临床笔记中提取和关联相关医学概念

生物医学研究

生物医学本体对齐

帮助整合不同来源的生物医学本体数据

🚀 FremyCompany/BioLORD-STAMB2-v1

本模型使用BioLORD进行训练，BioLORD是一种全新的预训练策略，用于为临床句子和生物医学概念生成有意义的表示。该模型在临床句子（MedSTS）和生物医学概念（MayoSRS）的文本相似度任务上达到了新的最优水平。

⚠️ 重要提示

此模型于2022年推出，自那时起，我们已发布了新版本。对于大多数用例，使用我们最新一代的BioLORD模型 BioLORD - 2023 会更合适。

当前的先进方法通过最大化指代同一概念的名称表示之间的相似度，并通过对比学习防止表示崩溃来进行操作。然而，由于生物医学名称并非总是能自解释，有时会导致非语义表示。

BioLORD通过使用定义以及从包含生物医学本体的多关系知识图中提取的简短描述来锚定其概念表示，从而克服了这一问题。得益于这种锚定，我们的模型生成了更具语义的概念表示，这些表示更紧密地匹配本体的层次结构。BioLORD在临床句子（MedSTS）和生物医学概念（MayoSRS）的文本相似度任务上确立了新的最优水平。

本模型基于 sentence - transformers/all - mpnet - base - v2，并在 BioLORD - 数据集上进行了进一步微调。

✨ 主要特性

这是一个 sentence - transformers 模型，可将句子和段落映射到768维的密集向量空间，适用于聚类或语义搜索等任务。
该模型针对生物医学领域进行了微调，在处理医学文档（如电子健康记录或临床笔记）时表现更优，同时也能为通用文本生成嵌入。
句子和短语可嵌入到相同的潜在空间中。

📦 安装指南

若要使用此模型，需安装 sentence - transformers：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

高级用法

若未安装 sentence - transformers，可按以下方式使用模型：首先将输入传递给Transformer模型，然后对上下文词嵌入应用正确的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 详细文档

模型相关信息

属性	详情
模型类型	基于 sentence - transformers/all - mpnet - base - v2 微调的生物医学领域模型
训练数据	BioLORD - 数据集

引用信息

本模型伴随论文 BioLORD: Learning Ontological Representations from Definitions，该论文已被EMNLP 2022 Findings收录。使用此模型时，请按以下方式引用原论文：

@inproceedings{remy-etal-2022-biolord,
    title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
    author = "Remy, François  and
      Demuynck, Kris  and
      Demeester, Thomas",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.104",
    pages = "1454--1465",
    abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}

你可能还想查看我们的MWE 2023论文：