stanford-deidentifier免费开源系统 - 自动化实现放射学报告高精度去标识化

首页

Stanford Deidentifier Only Radiology Reports Augmented

由 StanfordAIMI 开发

基于转换器模型的放射学报告自动化去标识化系统，结合规则方法实现高精度PHI识别与替换

序列标注

Transformers

英语开源协议:MIT #放射报告去标识化 #PHI自动检测 #生物医学NLP

下载量 30

发布时间 : 6/9/2022

模型简介

专为放射学和生物医学文档设计的自动化去标识化模型，通过检测受保护健康信息(PHI)实体并用安全替代值进行替换，满足HIPAA隐私要求

模型特点

跨机构高性能

在已知机构放射报告上取得97.9 F1值，新机构测试达99.6，超越人工标注水平

多领域适应性

训练数据包含6193份多机构跨领域文档，涵盖胸片、CT报告和普通医疗记录

混合方法设计

结合PubMedBERT转换器模型与'隐于寻常'规则方法，实现精准PHI检测与替换

模型能力

放射报告PHI识别

生物医学文本去标识化

敏感信息自动替换

跨机构文档处理

使用案例

医疗隐私保护

胸片报告去标识化

自动识别并替换胸部X光片中的患者信息、医生姓名和机构信息

在测试集上达到99.1%的核心PHI识别召回率

跨机构数据共享

处理来自不同医疗机构的放射学报告，实现标准化去标识化输出

在新机构数据上取得99.6 F1值

研究数据准备

临床研究数据脱敏

为医学研究准备符合隐私要求的放射学数据集

支持生成符合HIPAA标准的研究用数据集

🚀 斯坦福去标识符工具

斯坦福去标识符工具在多种放射学和生物医学文档上进行训练，旨在自动化去识别过程，同时达到足以用于实际生产的准确率。相关论文正在发表中。

🚀 快速开始

斯坦福去标识符工具可用于自动化放射学和生物医学文档的去识别过程。相关GitHub仓库：https://github.com/MIDRC/Stanford_Penn_Deidentifier

✨ 主要特性

多类型文档支持：可处理多种放射学和生物医学文档。
自动化去识别：能自动识别并处理文档中的敏感信息。
高准确率：在多个测试集上达到了令人满意的准确率。

📚 详细文档

示例文本

检查项目：胸部X光。对比：上次检查于2020年1月1日，还有2019年3月1日的记录。检查结果：片状肺野模糊影。诊断意见：2020年1月1日的胸部X光检查结果最令人担忧。患者被转至UH医疗中心的另一个科室，由佩雷斯医生负责。我们于2020年2月1日使用MedClinical数据传输系统发送了数据，编号为5874233。我们收到了佩雷斯医生的确认信息。他的联系电话是567 - 493 - 1234。

柯特·兰格洛茨医生选择在6月23日安排一次会议。

标签信息

属性	详情
模型类型	令牌分类、序列标记模型
训练数据	radreports数据集
框架	PyTorch、Transformers
预训练模型	PubmedBERT（无大小写区分）
应用领域	放射学、生物医学

📄 许可证

本项目采用MIT许可证。

📚 引用信息

如果您使用了本项目，请引用以下论文：

@article{10.1093/jamia/ocac219,
    author = {Chambon, Pierre J and Wu, Christopher and Steinkamp, Jackson M and Adleberg, Jason and Cook, Tessa S and Langlotz, Curtis P},
    title = "{Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods}",
    journal = {Journal of the American Medical Informatics Association},
    year = {2022},
    month = {11},
    abstract = "{To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates “hiding in plain sight.”In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests.Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span.Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports.A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.}",
    issn = {1527-974X},
    doi = {10.1093/jamia/ocac219},
    url = {https://doi.org/10.1093/jamia/ocac219},
    note = {ocac219},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocac219/47220191/ocac219.pdf},
}