🚀 AMPLIFY
AMPLIFY是一款高效的、先进的蛋白质语言模型,它在UniRef100、OAS和SCOP(UR100P)上通过掩码语言建模进行预训练。该模型可以生成残基和蛋白质嵌入、建议突变、区分无序蛋白质和非蛋白质序列等。AMPLIFY有两种规模,分别为1.2亿和3.5亿参数,其中_base
模型的序列长度不超过512个残基(阶段1)。下面将详细介绍模型架构和预训练过程。如需更多详情,请参考随附论文。
⚠️ 重要提示
此模型已使用NVIDIA的TransformerEngine库进行优化。原始模型和优化后的模型之间可能会观察到轻微的数值差异。有关如何安装TransformerEngine的说明,请参考官方文档。
基于xformers的原始模型可在chandar-lab/AMPLIFY获取。
✨ 主要特性
- 可生成残基和蛋白质嵌入。
- 能够建议突变。
- 可以区分无序蛋白质和非蛋白质序列。
📦 安装指南
文档未提及具体安装步骤,暂不展示。
💻 使用示例
基础用法
from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset
model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
model = model.to("cuda")
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")
for sample in dataset:
print("Sample: ", sample["name"], sample["sequence"])
input = tokenizer.encode(sample["sequence"], return_tensors="pt")
print("Input: ", input)
input = input.to("cuda")
output = model(input)
print("Output: ", output)
break
📚 详细文档
可用模型
模型描述
属性 |
AMPLIFY 120M |
AMPLIFY 350M |
hidden-size |
640 |
960 |
num-hidden-layers |
24 |
32 |
num-attention-heads |
10 |
15 |
intermediate-size |
2560 |
3840 |
max-position-embeddings |
2048 |
2048 |
vocab-size |
27 |
27 |
rope-theta |
10000 |
10000 |
dropout-prob |
0 |
0 |
embedding-init-range |
0.02 |
0.02 |
norm-eps |
1.0e-05 |
1.0e-05 |
hidden-act |
swiglu |
swiglu |
pre-activation-layer-norm |
true |
true |
layer-norm-after-embedding |
false |
false |
layer-norm-before-last-layer |
true |
true |
rms-norm |
true |
true |
ffn-bias |
false |
false |
attn-bias |
false |
false |
训练描述
属性 |
阶段1 |
阶段2 |
dataset |
UR100P |
UR100P |
max-steps |
1000000 |
25000 (120M) 或 50000 (350M) |
max-length |
512 |
2048 |
optimizer |
adamw |
adamw |
lr |
0.001 |
0.0001 |
betas |
(0.9, 0.95) |
(0.9, 0.95) |
eps |
1.0e-08 |
1.0e-08 |
weight-decay |
0.01 |
0.01 |
scheduler |
cosinedecay |
none |
warmup-steps |
1000 |
none |
final-step |
900000 |
none |
warmup-steps |
1000 |
none |
gradient-clipping |
1.0 |
1.0 |
tf32 |
true |
true |
mixed-precision |
bf16 |
bf16 |
padding |
max-length |
max-length |
random-truncate |
true |
true |
mask-probability |
0.15 |
0.15 |
total-batch-size |
4096 |
4096 |
deepspeed |
true |
true |
zero-stage |
3 |
3 |
🔧 技术细节
文档未提及具体技术细节,暂不展示。
📄 许可证
本项目采用MIT许可证。
📚 引用
如果您在研究中发现这些模型很有用,请引用以下论文:
@article{Fournier2024.09.23.614603,
title = {Protein Language Models: Is Scaling Necessary?},
author = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
year = {2024},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
doi = {10.1101/2024.09.23.614603},
url = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
elocation-id = {2024.09.23.614603},
eprint = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}