AMPLIFY_120M开源蛋白质语言模型 - 免费助力蛋白质序列分析与预测

首页

AMPLIFY 120M

由 nvidia 开发

AMPLIFY是一款高效的、先进的蛋白质语言模型，专注于蛋白质序列分析和预测任务。

蛋白质模型

Transformers

英语开源协议:MIT #蛋白质嵌入生成 #突变建议 #无序蛋白识别

下载量 202

发布时间 : 7/8/2025

模型简介

AMPLIFY是一个蛋白质语言模型，通过掩码语言建模在UniRef100、OAS和SCOP数据集上进行预训练。它能够生成残基和蛋白质嵌入、建议突变、区分无序蛋白质和非蛋白质序列等。

模型特点

高效的蛋白质序列处理

能够处理长达2048个残基的蛋白质序列，适用于大规模蛋白质分析。

多功能蛋白质分析

支持多种蛋白质相关任务，包括嵌入生成、突变建议和序列分类。

优化的训练架构

使用NVIDIA的TransformerEngine库进行优化，提高计算效率。

模型能力

生成残基和蛋白质嵌入

建议蛋白质突变

区分无序蛋白质和非蛋白质序列

蛋白质序列分析

使用案例

生物信息学

蛋白质功能预测

通过分析蛋白质序列预测其可能的功能和特性。

蛋白质设计

通过突变建议功能辅助设计新的蛋白质变体。

药物发现

靶点识别

帮助识别潜在的药物靶点蛋白质。

🚀 AMPLIFY

AMPLIFY是一款高效的、先进的蛋白质语言模型，它在UniRef100、OAS和SCOP（UR100P）上通过掩码语言建模进行预训练。该模型可以生成残基和蛋白质嵌入、建议突变、区分无序蛋白质和非蛋白质序列等。AMPLIFY有两种规模，分别为1.2亿和3.5亿参数，其中_base模型的序列长度不超过512个残基（阶段1）。下面将详细介绍模型架构和预训练过程。如需更多详情，请参考随附论文。

⚠️ 重要提示

此模型已使用NVIDIA的TransformerEngine库进行优化。原始模型和优化后的模型之间可能会观察到轻微的数值差异。有关如何安装TransformerEngine的说明，请参考官方文档。

基于xformers的原始模型可在chandar-lab/AMPLIFY获取。

✨ 主要特性

可生成残基和蛋白质嵌入。
能够建议突变。
可以区分无序蛋白质和非蛋白质序列。

📦 安装指南

文档未提及具体安装步骤，暂不展示。

💻 使用示例

基础用法

from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset

# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)

# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")

# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

for sample in dataset:
    # Protein
    print("Sample: ", sample["name"], sample["sequence"])

    # Tokenize the protein
    input = tokenizer.encode(sample["sequence"], return_tensors="pt")
    print("Input: ", input)

    # Move to the GPU and make a prediction
    input = input.to("cuda")
    output = model(input)
    print("Output: ", output)

    break

📚 详细文档

可用模型

模型描述

属性	AMPLIFY 120M	AMPLIFY 350M
`hidden-size`	640	960
`num-hidden-layers`	24	32
`num-attention-heads`	10	15
`intermediate-size`	2560	3840
`max-position-embeddings`	2048	2048
`vocab-size`	27	27
`rope-theta`	10000	10000
`dropout-prob`	0	0
`embedding-init-range`	0.02	0.02
`norm-eps`	1.0e-05	1.0e-05
`hidden-act`	swiglu	swiglu
`pre-activation-layer-norm`	true	true
`layer-norm-after-embedding`	false	false
`layer-norm-before-last-layer`	true	true
`rms-norm`	true	true
`ffn-bias`	false	false
`attn-bias`	false	false

训练描述

属性	阶段1	阶段2
`dataset`	UR100P	UR100P
`max-steps`	1000000	25000 (120M) 或 50000 (350M)
`max-length`	512	2048
`optimizer`	adamw	adamw
`lr`	0.001	0.0001
`betas`	(0.9, 0.95)	(0.9, 0.95)
`eps`	1.0e-08	1.0e-08
`weight-decay`	0.01	0.01
`scheduler`	cosinedecay	none
`warmup-steps`	1000	none
`final-step`	900000	none
`warmup-steps`	1000	none
`gradient-clipping`	1.0	1.0
`tf32`	true	true
`mixed-precision`	bf16	bf16
`padding`	max-length	max-length
`random-truncate`	true	true
`mask-probability`	0.15	0.15
`total-batch-size`	4096	4096
`deepspeed`	true	true
`zero-stage`	3	3

🔧 技术细节

文档未提及具体技术细节，暂不展示。

📄 许可证

本项目采用MIT许可证。

📚 引用

如果您在研究中发现这些模型很有用，请引用以下论文：

@article{Fournier2024.09.23.614603,
	title        = {Protein Language Models: Is Scaling Necessary?},
	author       = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
	year         = {2024},
	journal      = {bioRxiv},
	publisher    = {Cold Spring Harbor Laboratory},
	doi          = {10.1101/2024.09.23.614603},
	url          = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
	elocation-id = {2024.09.23.614603},
	eprint       = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}