🚀 CamemBERT-bio:一款有益于健康的美味法语语言模型
CamemBERT-bio是一款先进的法语生物医学语言模型,它基于camembert-base进行持续预训练构建而成。该模型在一个包含4.13亿个单词的法语公共生物医学语料库上进行训练,语料库涵盖了科学文献、药品说明书以及从论文和文章中提取的临床病例。与camembert-base相比,它在5种不同的生物医学命名实体识别任务中,平均F1分数提高了2.54分。
✨ 主要特性
- 专业领域优化:专为法语生物医学领域设计,在生物医学命名实体识别任务上表现出色,相比基础模型有显著的性能提升。
- 丰富语料训练:使用包含科学文献、药品说明书和临床病例的大规模法语生物医学语料库进行训练,数据涵盖面广。
📦 模型信息
属性 |
详情 |
模型类型 |
基于持续预训练的法语生物医学语言模型 |
训练数据 |
一个包含4.13亿个单词的法语公共生物医学语料库,包含科学文献、药品说明书以及从论文和文章中提取的临床病例 |
🔧 技术细节
训练数据
语料库 |
详情 |
规模 |
ISTEX |
ISTEX上索引的多样化科学文献 |
2.76亿 |
CLEAR |
药品说明书 |
7300万 |
E3C |
来自期刊、药品说明书和临床病例的各种文档 |
6400万 |
总计 |
|
4.13亿 |
训练过程
我们基于camembert-base进行持续预训练。使用带全词掩码的掩码语言建模(MLM)目标对模型进行训练,在39小时内进行了50000步训练,使用了2块Tesla V100。
📚 评估
微调
在微调过程中,我们使用Optuna来选择超参数。学习率设置为5e - 5,热身比例为0.224,批量大小为16。微调过程进行了2000步。在预测时,在模型顶部添加了一个简单的线性层。值得注意的是,在微调过程中,没有冻结任何CamemBERT层。
评分
为了评估模型的性能,我们使用seqeval工具在严格模式下采用IOB2方案进行评估。对于每次评估,选择在验证集上表现最佳的微调模型来计算测试集的最终分数。为确保可靠性,我们对10次使用不同种子的评估结果进行了平均。
结果
风格 |
数据集 |
分数 |
CamemBERT |
CamemBERT - bio |
临床 |
CAS1 |
F1 |
70.50 ± 1.75 |
73.03 ± 1.29 |
|
|
P |
70.12 ± 1.93 |
71.71 ± 1.61 |
|
|
R |
70.89 ± 1.78 |
74.42 ± 1.49 |
|
CAS2 |
F1 |
79.02 ± 0.92 |
81.66 ± 0.59 |
|
|
P |
77.3 ± 1.36 |
80.96 ± 0.91 |
|
|
R |
80.83 ± 0.96 |
82.37 ± 0.69 |
|
E3C |
F1 |
67.63 ± 1.45 |
69.85 ± 1.58 |
|
|
P |
78.19 ± 0.72 |
79.11 ± 0.42 |
|
|
R |
59.61 ± 2.25 |
62.56 ± 2.50 |
药品说明书 |
EMEA |
F1 |
74.14 ± 1.95 |
76.71 ± 1.50 |
|
|
P |
74.62 ± 1.97 |
76.92 ± 1.96 |
|
|
R |
73.68 ± 2.22 |
76.52 ± 1.62 |
科学 |
MEDLINE |
F1 |
65.73 ± 0.40 |
68.47 ± 0.54 |
|
|
P |
64.94 ± 0.82 |
67.77 ± 0.88 |
|
|
R |
66.56 ± 0.56 |
69.21 ± 1.32 |
🌱 环境影响估计
- 硬件类型:2块Tesla V100
- 使用时长:39小时
- 服务提供商:INRIA集群
- 计算区域:法国巴黎
- 碳排放:0.84千克二氧化碳当量
📄 许可证
本项目采用MIT许可证。
📖 引用信息
@inproceedings{touchent-de-la-clergerie-2024-camembert-bio,
title = "{C}amem{BERT}-bio: Leveraging Continual Pre-training for Cost-Effective Models on {F}rench Biomedical Data",
author = "Touchent, Rian and
de la Clergerie, {\'E}ric",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.241",
pages = "2692--2701",
abstract = "Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.",
}
@inproceedings{touchent:hal-04130187,
TITLE = {{CamemBERT-bio : Un mod{\`e}le de langue fran{\c c}ais savoureux et meilleur pour la sant{\'e}}},
AUTHOR = {Touchent, Rian and Romary, Laurent and De La Clergerie, Eric},
URL = {https://hal.science/hal-04130187},
BOOKTITLE = {{18e Conf{\'e}rence en Recherche d'Information et Applications \\ 16e Rencontres Jeunes Chercheurs en RI \\ 30e Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles \\ 25e Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues}},
ADDRESS = {Paris, France},
EDITOR = {Servan, Christophe and Vilnat, Anne},
PUBLISHER = {{ATALA}},
PAGES = {323-334},
YEAR = {2023},
KEYWORDS = {comptes rendus m{\'e}dicaux ; TAL clinique ; CamemBERT ; extraction d'information ; biom{\'e}dical ; reconnaissance d'entit{\'e}s nomm{\'e}es},
HAL_ID = {hal-04130187},
HAL_VERSION = {v1},
}
👥 开发信息