ProstT5开源蛋白质语言模型 - 免费实现蛋白质序列与结构翻译

首页

Prostt5

由 Rostlab 开发

ProstT5是一种蛋白质语言模型，能够在蛋白质序列与结构之间进行翻译。

蛋白质模型

Transformers

开源协议:MIT #蛋白质序列结构翻译 #3Di标记嵌入 #远程同源检测

下载量 252.91k

发布时间 : 7/21/2023

模型简介

ProstT5（蛋白质结构序列T5）基于ProtT5-XL-U50，通过微调实现了蛋白质序列与3D结构之间的双向翻译。它支持从氨基酸序列预测3D结构（折叠）和从3D结构生成氨基酸序列（逆折叠）。

模型特点

双向翻译能力

支持蛋白质序列（AA）与结构（3Di）之间的双向翻译，包括折叠（AA→3Di）和逆折叠（3Di→AA）

基于ProtT5-XL-U50微调

在1700万高质量3D结构预测蛋白质上微调，继承了ProtT5-XL-U50的强大表示能力

结构特征提取

能够从3Di标记表示的3D结构中提取特征，扩展了传统蛋白质语言模型的功能

模型能力

蛋白质序列到结构翻译

蛋白质结构到序列翻译

蛋白质序列特征提取

蛋白质结构特征提取

使用案例

生物信息学

远程同源检测

通过预测的3Di字符串与Foldseek结合，无需显式计算3D结构即可进行远程同源检测

蛋白质设计

通过逆折叠从3D结构生成可能的氨基酸序列，辅助蛋白质设计

计算生物学

蛋白质结构预测

从氨基酸序列预测3D结构的简化表示（3Di标记）

🚀 ProstT5模型卡片

ProstT5是一款蛋白质语言模型（pLM），能够实现蛋白质序列与结构之间的相互转换，为蛋白质相关研究提供了强大的工具。

🚀 快速开始

特征提取

from transformers import T5Tokenizer, T5EncoderModel
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below) while 3Di-sequences need to be lower-case ("strctr" below).
sequence_examples = ["PRTEINO", "strct"]

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly (this already expects 3Di-sequences to be lower-case)
# if you go from AAs to 3Di (or if you want to embed AAs), you need to prepend "<AA2fold>"
# if you go from 3Di to AAs (or if you want to embed 3Di), you need to prepend "<fold2AA>"
sequence_examples = [ "<AA2fold>" + " " + s if s.isupper() else "<fold2AA>" + " " + s
                      for s in sequence_examples
                    ]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device))

# generate embeddings
with torch.no_grad():
    embedding_rpr = model(
              ids.input_ids, 
              attention_mask=ids.attention_mask
              )

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens, incl. prefix ([0,1:8]) 
emb_0 = embedding_repr.last_hidden_state[0,1:8] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:6])
emb_1 = embedding_repr.last_hidden_state[1,1:6] # shape (5 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

翻译（“折叠”，即从氨基酸到3Di）

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list.
# Amino acid sequences are expected to be upper-case ("PRTEINO" below)
# while 3Di-sequences need to be lower-case.
sequence_examples = ["PRTEINO", "SEQWENCE"]
min_len = min([ len(s) for s in folding_example])
max_len = max([ len(s) for s in folding_example])

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly. For the translation from AAs to 3Di, you need to prepend "<AA2fold>"
sequence_examples = [ "<AA2fold>" + " " + s for s in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Generation configuration for "folding" (AA-->3Di)
gen_kwargs_aa2fold = {
                  "do_sample": True,
                  "num_beams": 3, 
                  "top_p" : 0.95, 
                  "temperature" : 1.2, 
                  "top_k" : 6,
                  "repetition_penalty" : 1.2,
}

# translate from AA to 3Di (AA-->3Di)
with torch.no_grad():
  translations = model.generate( 
              ids.input_ids, 
              attention_mask=ids.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_aa2fold
  )
# Decode and remove white-spaces between tokens
decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )
structure_sequences = [ "".join(ts.split(" ")) for ts in decoded_translations ] # predicted 3Di strings

# Now we can use the same model and invert the translation logic
# to generate an amino acid sequence from the predicted 3Di-sequence (3Di-->AA)

# add pre-fixes accordingly. For the translation from 3Di to AA (3Di-->AA), you need to prepend "<fold2AA>"
sequence_examples_backtranslation = [ "<fold2AA>" + " " + s for s in decoded_translations]

# tokenize sequences and pad up to the longest sequence in the batch
ids_backtranslation = tokenizer.batch_encode_plus(sequence_examples_backtranslation,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Example generation configuration for "inverse folding" (3Di-->AA)
gen_kwargs_fold2AA = {
            "do_sample": True,
            "top_p" : 0.90,
            "temperature" : 1.1,
            "top_k" : 6,
            "repetition_penalty" : 1.2,
}

# translate from 3Di to AA (3Di-->AA)
with torch.no_grad():
  backtranslations = model.generate( 
              ids_backtranslation.input_ids, 
              attention_mask=ids_backtranslation.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_fold2AA
  )
# Decode and remove white-spaces between tokens
decoded_backtranslations = tokenizer.batch_decode( backtranslations, skip_special_tokens=True )
aminoAcid_sequences = [ "".join(ts.split(" ")) for ts in decoded_backtranslations ] # predicted amino acid strings

✨ 主要特性

跨模态转换：能够实现蛋白质序列与结构之间的相互转换，为蛋白质研究提供了新的视角。
特征提取：可用于传统的特征提取，且相比原模型，还能对由3Di令牌表示的3D结构进行嵌入。
折叠与反折叠：支持从序列到结构的“折叠”以及从结构到序列的“反折叠”操作。

📚 详细文档

模型详情

模型描述

ProstT5（蛋白质结构 - 序列T5）基于ProtT5-XL-U50构建，这是一个在数十亿蛋白质序列上应用跨度损坏技术进行蛋白质序列编码训练的T5模型。ProstT5在来自AlphaFoldDB的1700万个具有高质量3D结构预测的蛋白质上对ProtT5-XL-U50进行微调，以实现蛋白质序列与结构之间的转换。蛋白质结构通过Foldseek引入的3Di令牌从3D转换为1D。

在第一步，ProstT5通过继续对3Di和氨基酸（AA）序列应用原始的跨度去噪目标，学习表示新引入的3Di令牌。仅在第二步，ProstT5才进行两种模态之间的转换训练。转换方向由两个特殊令牌表示（“”用于从3Di转换为AA，“”用于从AA转换为3Di）。为避免与AA令牌冲突，3Di令牌转换为小写（否则字母相同）。

开发者：Michael Heinzinger（GitHub @mheinzinger；Twitter @HeinzingerM）
模型类型：编码器 - 解码器（T5）
语言（NLP）：蛋白质序列和结构
许可证：MIT
微调基础模型：ProtT5-XL-U50

用途

特征提取：该模型可用于传统的特征提取。为此，我们建议仅使用编码器，采用半精度（fp16）并结合批处理。示例（目前仅适用于原始的ProtT5-XL-U50，但替换存储库链接并添加前缀即可使用）：脚本和Colab。与原始的ProtT5-XL-U50只能嵌入AA序列不同，ProstT5现在还能嵌入由3Di令牌表示的3D结构。3Di令牌可以通过Foldseek从3D结构派生，也可以由ProstT5从AA序列预测得到。
“折叠”：从序列（AA）到结构（3Di）的转换。得到的3Di字符串可与Foldseek一起用于远程同源性检测，同时避免显式计算3D结构。
“反折叠”：从结构（3Di）到序列（AA）的转换。

训练详情

训练数据

预训练数据（1700万个蛋白质的3Di + AA序列）

训练过程

预训练的第一阶段使用此脚本继续对3Di和AA序列进行基于跨度的去噪。预训练的第二阶段（即从3Di到AA序列的实际转换以及反之），我们使用此脚本。

训练超参数

训练机制：我们使用了DeepSpeed（阶段2）、梯度累积步骤（5步）、混合半精度（bf16）和PyTorch2.0的torchInductor编译器。

速度

在配备48GB显存的单个RTX A6000 GPU上，使用批处理和半精度（fp16），从Pro(s)tT5编码器为人类蛋白质组生成嵌入大约需要35分钟，即每个蛋白质约0.1秒。由于解码过程需要从左到右逐个令牌生成，具有顺序性，因此翻译相对较慢（平均长度分别为135和406时，每个蛋白质为0.6 - 2.5秒）。我们仅使用了批处理和半精度，未进行进一步优化。