🚀 序列嵌入与基因表达预测代码片段
此代码片段可根据给定的DNA、RNA和蛋白质序列来获取嵌入向量和基因表达预测结果。
🚀 快速开始
以下是使用代码获取序列嵌入和基因表达预测的示例:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import numpy as np
import torch
tokenizer = AutoTokenizer.from_pretrained("isoformer-anonymous/Isoformer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("isoformer-anonymous/Isoformer",trust_remote_code=True)
protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
rna_sequences = ["ATTCCGGTTTTCA" * 9]
sequence_length = 196_608
rng = np.random.default_rng(seed=0)
dna_sequences = ["".join(rng.choice(list("ATCGN"), size=(sequence_length,)))]
torch_tokens = tokenizer(
dna_input=dna_sequences, rna_input=rna_sequences, protein_input=protein_sequences
)
dna_torch_tokens = torch.tensor(torch_tokens[0]["input_ids"])
rna_torch_tokens = torch.tensor(torch_tokens[1]["input_ids"])
protein_torch_tokens = torch.tensor(torch_tokens[2]["input_ids"])
torch_output = model.forward(
tensor_dna=dna_torch_tokens,
tensor_rna=rna_torch_tokens,
tensor_protein=protein_torch_tokens,
attention_mask_rna=rna_torch_tokens != 1,
attention_mask_protein=protein_torch_tokens != 1,
)
print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
💻 使用示例
基础用法
tokenizer = AutoTokenizer.from_pretrained("isoformer-anonymous/Isoformer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("isoformer-anonymous/Isoformer",trust_remote_code=True)
protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
rna_sequences = ["ATTCCGGTTTTCA" * 9]
sequence_length = 196_608
rng = np.random.default_rng(seed=0)
dna_sequences = ["".join(rng.choice(list("ATCGN"), size=(sequence_length,)))]
torch_tokens = tokenizer(
dna_input=dna_sequences, rna_input=rna_sequences, protein_input=protein_sequences
)
dna_torch_tokens = torch.tensor(torch_tokens[0]["input_ids"])
rna_torch_tokens = torch.tensor(torch_tokens[1]["input_ids"])
protein_torch_tokens = torch.tensor(torch_tokens[2]["input_ids"])
torch_output = model.forward(
tensor_dna=dna_torch_tokens,
tensor_rna=rna_torch_tokens,
tensor_protein=protein_torch_tokens,
attention_mask_rna=rna_torch_tokens != 1,
attention_mask_protein=protein_torch_tokens != 1,
)
print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
此基础用法展示了如何导入模型和分词器,准备DNA、RNA和蛋白质序列,进行分词处理,最后通过模型前向传播得到基因表达预测和DNA嵌入结果。
高级用法
由于原文档未提供高级用法相关代码,暂无法展示。若有更多复杂场景的代码示例,可按照基础用法的流程,结合具体需求对序列数据和模型调用进行调整。