标签:
GENA-LM (gena-lm-bert-large-t2t)
GENA-LM 是一个面向长DNA序列的开源基础模型家族。
GENA-LM模型是基于人类DNA序列训练的Transformer掩码语言模型。
GENA-LM (gena-lm-bert-large-t2t
) 与DNABERT的主要区别:
- 采用BPE分词而非k-mer分词;
- 输入序列长度约4500个核苷酸(512个BPE标记),而DNABERT为512个核苷酸;
- 基于T2T人类基因组组装进行预训练,而非GRCh38.p13版本。
源代码与数据:https://github.com/AIRI-Institute/GENA_LM
论文:https://academic.oup.com/nar/article/53/2/gkae1310/7954523
本仓库还包含下游任务微调模型:
以及用于GENA-Web基因组序列注释工具的模型:
使用示例
加载预训练掩码语言模型
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)
加载预训练模型进行下游分类任务微调
从GENA-LM仓库获取模型类:
git clone https://github.com/AIRI-Institute/GENA_LM.git
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
或直接下载modeling_bert.py放置于代码目录。
也可通过HuggingFace AutoModel获取:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', num_labels=2)
模型架构
GENA-LM (gena-lm-bert-large-t2t
) 采用BigBird论文提出的掩码语言模型(MLM)方法,掩码比例为15%。模型配置与bert-large-uncased
相似:
- 最大序列长度512
- 24层网络,16个注意力头
- 1024维隐藏层
- 32k词表大小
基于T2T人类基因组组装(https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/)预训练,数据通过1000基因组计划SNPs(gnomAD数据集)采样突变进行增强。训练参数:1,750,000次迭代,批次大小256,序列长度512标记。采用Pre-Layer normalization改进Transformer。
评估
评估结果详见论文:https://academic.oup.com/nar/article/53/2/gkae1310/7954523
引用
@article{GENA_LM,
author = {Fishman, Veniamin and Kuratov, Yuri and Shmelev, Aleksei and Petrov, Maxim and Penzar, Dmitry and Shepelin, Denis and Chekanov, Nikolay and Kardymon, Olga and Burtsev, Mikhail},
title = {GENA-LM: a family of open-source foundational DNA language models for long sequences},
journal = {Nucleic Acids Research},
volume = {53},
number = {2},
pages = {gkae1310},
year = {2025},
month = {01},
issn = {0305-1048},
doi = {10.1093/nar/gkae1310},
url = {https://doi.org/10.1093/nar/gkae1310},
eprint = {https://academic.oup.com/nar/article-pdf/53/2/gkae1310/61443229/gkae1310.pdf},
}