SegmentNT-multi-species开源基因组分割模型 - 精准预测多种基因组元素位置

首页

Segment Nt Multi Species

由 InstaDeepAI 开发

SegmentNT-multi-species 是一种基于Nucleotide Transformer的分割模型，用于以单核苷酸分辨率预测多种基因组元素的位置。

蛋白质模型

Transformers

#基因组元素分割 #多物种DNA分析 #单核苷酸分辨率

下载量 102

发布时间 : 3/5/2024

模型简介

该模型是在SegmentNT模型的基础上，通过在人类及五种选定物种（小鼠、鸡、果蝇、斑马鱼和线虫）的基因组数据集上进行微调而得到的，能够预测7种主要基因元素的位置。

模型特点

多物种支持

支持人类及五种其他物种（小鼠、鸡、果蝇、斑马鱼和线虫）的基因组分析。

高分辨率分割

能够以单核苷酸分辨率预测基因组元素的位置。

高效训练

在DGXH100节点上使用8个GPU进行了3天的微调，共处理了80亿个token。

模型能力

基因组元素预测

DNA序列分析

多物种基因组分割

使用案例

基因组学研究

基因元素定位

预测DNA序列中外显子、内含子等基因元素的位置。

能够准确识别7种主要基因元素的位置。

跨物种比较

分析不同物种间基因组元素的相似性和差异性。

🚀 多物种核苷酸序列分割模型（segment-nt-multi-species）

SegmentNT-multi-species 是一个分割模型，它借助 Nucleotide Transformer（NT）DNA 基础模型，以单核苷酸分辨率预测序列中多种基因组元素的位置。该模型是在包含人类基因组以及 5 个选定物种（小鼠、鸡、果蝇、斑马鱼和蠕虫）基因组的数据集上对 SegmentNT 模型进行微调的结果。

在对多物种基因组进行微调时，我们精心整理了用于训练 SegmentNT 的注释子集数据集，主要是因为只有这部分注释可用于这些物种。因此，这些注释涉及从 Ensembl 获得的 7 个主要基因元素，即蛋白质编码基因、5’非翻译区（UTR）、3’非翻译区（UTR）、内含子、外显子、剪接受体和供体位点。

开发者： InstaDeep

🚀 快速开始

模型来源

仓库地址： Nucleotide Transformer
论文地址： 使用 DNA 基础模型以单核苷酸分辨率分割基因组

如何使用

在 transformers 库的下一个版本发布之前，若要使用该模型，需要通过以下命令从源代码安装该库：

pip install --upgrade git+https://github.com/huggingface/transformers.git

以下是一个小代码片段，用于从一个虚拟 DNA 序列中获取对数几率（logits）和嵌入（embeddings）。

⚠️ 重要提示

默认情况下，最大序列长度设置为训练长度 30,000 个核苷酸，即 5001 个标记（包括 CLS 标记）。不过，SegmentNT 已被证明可以推广到长度达 50,000 个碱基对的序列。如果需要对长度在 30kbp 到 50kbp 之间的序列进行推理，请确保将配置中的 rescaling_factor 参数更改为 num_dna_tokens_inference / max_num_tokens_nt，其中 num_dna_tokens_inference 是推理时的标记数量（例如，对于长度为 40008 个碱基对的序列，标记数量为 6669），max_num_tokens_nt 是基础核苷酸变换器训练时的最大标记数量，即 2048。

# 加载模型和分词器
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# 选择输入序列填充的长度。默认情况下，选择模型的最大长度，但可以根据需要减小该长度，因为获取嵌入所需的时间会随长度显著增加。
# DNA 标记的数量（不包括前置的 CLS 标记）需要能被 2 的下采样块数量次方整除，即 4。
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# 创建一个虚拟 DNA 序列并进行分词
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# 推理
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# 获取基因组特征的对数几率
logits = outs.logits.detach()
# 将其转换为概率
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# 获取与内含子相关的概率
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

✨ 主要特性

借助 Nucleotide Transformer DNA 基础模型，以单核苷酸分辨率预测基因组元素位置。
在包含人类及 5 个选定物种基因组的数据集上微调，可用于多物种基因组分析。

📦 安装指南

在 transformers 库的下一个版本发布之前，使用以下命令从源代码安装该库：

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 使用示例

基础用法

# 加载模型和分词器
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# 选择输入序列填充的长度。默认情况下，选择模型的最大长度，但可以根据需要减小该长度，因为获取嵌入所需的时间会随长度显著增加。
# DNA 标记的数量（不包括前置的 CLS 标记）需要能被 2 的下采样块数量次方整除，即 4。
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# 创建一个虚拟 DNA 序列并进行分词
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# 推理
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# 获取基因组特征的对数几率
logits = outs.logits.detach()
# 将其转换为概率
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# 获取与内含子相关的概率
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

📚 详细文档

训练数据

segment-nt-multi-species 模型在人类、小鼠、鸡、果蝇、斑马鱼和蠕虫的基因组上进行了微调。对于每个物种，都保留了一部分染色体用于训练监控的验证集和最终评估的测试集。

训练过程

预处理

DNA 序列使用 Nucleotide Transformer 分词器进行分词，该分词器将序列分词为 6 聚体标记，具体描述见关联仓库的 Tokenization 部分。该分词器的词汇表大小为 4105。模型的输入形式如下：

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

训练

该模型在配备 8 个 GPU 的 DGXH100 节点上，对总共 80 亿个标记进行了 3 天的微调训练。

架构

该模型由 nucleotide-transformer-v2-500m-multi-species 编码器组成，我们移除了其中的语言模型头，并将其替换为一个 1 维 U-Net 分割头 [4]，该分割头由 2 个下采样卷积块和 2 个上采样卷积块组成。每个卷积块由 2 个卷积层组成，分别有 1024 和 2048 个卷积核。这个额外的分割头包含 5300 万个参数，使模型的总参数数量达到 5.62 亿。

BibTeX 引用和引用信息

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}