ProtT5-XL-BFD开源蛋白质模型 - 免费用于蛋白质特征提取及下游任务微调

首页

Prot T5 Xl Bfd

由 Rostlab 开发

ProtT5-XL-BFD是基于蛋白质序列的自监督预训练模型，采用T5架构，在21亿蛋白质序列上训练，用于蛋白质特征提取和下游任务微调。

蛋白质模型

Transformers

#蛋白质序列特征提取 #自监督预训练 #生物物理特性建模

下载量 605

发布时间 : 3/2/2022

模型简介

该模型通过掩码语言建模目标在大型蛋白质序列语料库上进行预训练，能够捕捉蛋白质的生物物理特性，适用于蛋白质结构预测和功能分析。

模型特点

大规模预训练

在包含21亿蛋白质序列的BFD数据集上预训练，覆盖广泛的蛋白质多样性。

自监督学习

无需人工标注，通过掩码语言建模目标从原始蛋白质序列中学习。

生物物理特性捕捉

模型提取的特征能够反映决定蛋白质形状的重要生物物理特性。

模型能力

蛋白质序列特征提取

蛋白质结构预测

蛋白质功能分析

使用案例

生物信息学

蛋白质二级结构预测

用于预测蛋白质的二级结构（3态或8态分类）。

在CASP12数据集上达到77%准确率（3态）

亚细胞定位预测

预测蛋白质在细胞中的定位位置。

在DeepLoc数据集上达到77%准确率

🚀 ProtT5-XL-BFD模型

ProtT5-XL-BFD是一个基于蛋白质序列预训练的模型，采用掩码语言模型（MLM）目标。它能够从蛋白质序列中提取重要特征，在蛋白质相关的下游任务中具有广泛的应用前景。

🚀 快速开始

ProtT5-XL-BFD是基于t5 - 3b模型，以自监督的方式在大量蛋白质序列语料库上进行预训练的模型。以下是在PyTorch中使用该模型提取给定蛋白质序列特征的示例代码：

from transformers import T5Tokenizer, T5Model
import re
import torch

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)

model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")

sequences_Example = ["A E T C Z A O","S K T Z P"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])

with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)

# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()

✨ 主要特性

自监督预训练：在大量蛋白质序列上进行自监督学习，无需人工标注数据，可利用大量公开数据。
特征提取能力：从该自监督模型中提取的特征（LM - 嵌入）能够捕捉到决定蛋白质形状的重要生物物理特性。
特定的训练目标：与原始T5模型不同，采用类似Bart的MLM去噪目标进行预训练。

📚 详细文档

模型描述

ProtT5-XL-BFD基于t5 - 3b模型，以自监督的方式在大型蛋白质序列语料库上进行预训练。这意味着它仅在原始蛋白质序列上进行预训练，没有任何人工标注（因此可以使用大量公开数据），并通过自动过程从这些蛋白质序列中生成输入和标签。

该T5模型与原始T5版本的一个重要区别在于去噪目标。原始T5 - 3B模型使用跨度去噪目标进行预训练，而该模型使用类似Bart的MLM去噪目标进行预训练。掩码概率与原始T5训练一致，随机掩码输入中15%的氨基酸。

预期用途与限制

该模型可用于蛋白质特征提取或在下游任务中进行微调。在某些任务中，微调模型比将其用作特征提取器可以获得更高的准确性。此外，对于特征提取，建议使用编码器提取的特征而非解码器的特征。

训练数据

ProtT5-XL-BFD模型在BFD数据集上进行预训练，该数据集包含21亿个蛋白质序列。

训练过程

预处理

蛋白质序列先转换为大写，使用单个空格进行分词，词汇表大小为21。将罕见氨基酸“U,Z,O,B”映射为“X”。模型的输入形式为：

Protein Sequence [EOS]

预处理步骤是动态执行的，将蛋白质序列裁剪和填充至最多512个标记。

每个序列的掩码过程细节如下：

15%的氨基酸被掩码。
90%的情况下，被掩码的氨基酸被[MASK]标记替换。
10%的情况下，被掩码的氨基酸被一个与它们所替换的氨基酸不同的随机氨基酸替换。

预训练

模型在单个TPU Pod V3 - 1024上总共训练120万步，使用序列长度512（批量大小4k）。它总共有约30亿个参数，采用编码器 - 解码器架构进行训练。预训练使用的优化器是AdaFactor，采用逆平方根学习率调度。

评估结果

当模型用于特征提取时，取得了以下结果：

任务/数据集	二级结构（3状态）	二级结构（8状态）	定位	膜蛋白
CASP12	77	66
TS115	85	74
CB513	84	71
DeepLoc			77	91

BibTeX引用

@article {Elnaggar2020.07.12.199554,
	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
	elocation-id = {2020.07.12.199554},
	year = {2020},
	doi = {10.1101/2020.07.12.199554},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
	journal = {bioRxiv}
}