开源pythia-160m-c2s模型 - 实现单细胞RNA测序数据条件/无条件细胞生成及类型预测

首页

Pythia 160m C2s

由 vandijklab 开发

这是一个基于Pythia-160m语言模型，使用Cell2Sentence方法在单细胞RNA测序数据上微调的模型，能够进行条件性细胞生成、无条件细胞生成和细胞类型预测。

分子模型

Transformers

英语#单细胞RNA序列生成 #基因排序预测 #细胞类型分类

下载量 64

发布时间 : 2/14/2024

模型简介

该模型将单细胞RNA测序数据转化为按表达水平排序的基因名称序列（称为'细胞句子'），利用大型语言模型处理单细胞转录组学数据。

模型特点

Cell2Sentence方法

创新性地将单细胞RNA测序数据转化为基因名称序列，使语言模型能够处理转录组数据

多任务能力

支持条件性细胞生成、无条件细胞生成和细胞类型预测三种主要任务

高性能表现

在k近邻分类和Gromov-Wasserstein距离评估中优于同类模型

模型能力

单细胞转录组数据分析

条件性细胞生成

无条件细胞生成

细胞类型预测

使用案例

生物医学研究

免疫细胞分析

基于免疫组织数据集生成特定类型的免疫细胞表达谱

可用于研究免疫细胞的特异性和功能

细胞类型识别

根据基因表达模式预测未知细胞的类型

在测试数据上表现出优于其他方法的分类性能

药物开发

虚拟细胞生成

生成特定条件下的虚拟细胞表达数据

可用于药物筛选和效果预测

🚀 Pythia-160m-c2s模型

这是由EleutherAI开发的Pythia-160m模型，使用Cell2Sentence方法在完整的单细胞RNA测序（scRNA-seq）细胞上进行了微调。Cell2Sentence是一种将大语言模型应用于单细胞转录组学的新方法。我们将单细胞RNA测序数据转换为由表达水平排序的基因名称序列，称为“细胞句子”。更多详细信息，请参考下面链接的论文。该模型在来自Domínguez等人的免疫组织数据集上进行训练，使用8块A100 40GB GPU，耗时约20小时，完成以下任务：

条件细胞生成
无条件细胞生成
细胞类型预测

🚀 快速开始

本模型是基于Pythia-160m模型，利用Cell2Sentence方法在完整的scRNA-seq细胞上微调得到。Cell2Sentence能将单细胞转录组数据转化为“细胞句子”，便于大语言模型处理。

✨ 主要特性

创新方法应用：采用Cell2Sentence方法，将单细胞RNA测序数据转换为基因名称序列，适配大语言模型处理单细胞转录组学。
多任务训练：在条件细胞生成、无条件细胞生成和细胞类型预测等任务上进行训练。
高性能表现：在KNN分类和Gromov - Wasserstein（GW）距离评估中表现出色。

📦 安装指南

文档未提供安装步骤，此处跳过。

💻 使用示例

基础用法

我们提供了一个如何使用该模型进行条件细胞生成的示例，其中包含一个后处理函数，用于去除重复和无效的基因。为了生成完整的细胞，需要将max_length生成参数更改为9200。不过，如果需要生成完整细胞，建议使用A100 GPU以保证推理速度和内存容量。

import json
import re
from collections import Counter
from typing import List

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


def post_process_generated_cell_sentences(
    cell_sentence: str, 
    gene_dictionary: List
):
    """
    Post-processing function for generated cell sentences. 
    Invalid genes are removed and ranks of duplicated genes are averaged.

    Arguments:
        cell_sentence:              generated cell sentence string
        gene_dictionary:            list of gene vocabulary (all uppercase)

    Returns:
        post_processed_sentence:    generated cell sentence after post processing steps
    """
    generated_gene_names = cell_sentence.split(" ")
    generated_gene_names = [generated_gene.upper() for generated_gene in generated_gene_names]

    #--- Remove nonsense genes ---#
    generated_gene_names = [gene_name for gene_name in generated_gene_names if gene_name in gene_dictionary]

    #--- Average ranks ---#
    gene_name_to_occurrences = Counter(generated_gene_names)  # get mapping of gene name --> number of occurrences
    post_processed_sentence = generated_gene_names.copy()  # copy of generated gene list

    for gene_name in gene_name_to_occurrences:
        if gene_name_to_occurrences[gene_name] > 1 and gene_name != replace_nonsense_string:
            # Find positions of all occurrences of duplicated generated gene in list
            # Note: using post_processed_sentence here; since duplicates are being removed, list will be
            #   getting shorter. Getting indices in original list will no longer be accurate positions
            occurrence_positions = [idx for idx, elem in enumerate(post_processed_sentence) if elem == gene_name]
            average_position = int(sum(occurrence_positions) / len(occurrence_positions))

            # Remove occurrences
            post_processed_sentence = [elem for elem in post_processed_sentence if elem != gene_name]

            # Reinsert gene_name at average position
            post_processed_sentence.insert(average_position, gene_name)
    
    return post_processed_sentence

genes_path = "pbmc_vocab.json"

with open(vocab_path, "r") as f:
    gene_dictionary = json.load(f)

model_name = "vandijklab/pythia-160m-c2s"

model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16, 
        attn_implementation="flash_attention_2"
        ).to(torch.device("cuda"))
tokenizer = AutoTokenizer.from_pretrained(model_name)

cell_type = "T Cell"
ccg = f"Enumerate the genes in a {cell_type} cell with nonzero expression, from highest to lowest."

# Prompts for other forms a generation.
# ucg = "Display a cell's genes by expression level, in descending order."
# cellsentence = "CELL_SENTENCE"
# ctp = "Identify the cell type most likely associated with these highly expressed genes listed in descending order. "
#  + cellsentence +
#  "Name the cell type connected to these genes, ranked from highest to lowest expression."

tokens = tokenizer(ccg, return_tensors='pt')
input_ids = tokens['input_ids'].to(torch.device("cuda"))
attention_mask = tokens['attention_mask'].to(torch.device("cuda"))

with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        do_sample=True,
        max_length=1024,
        top_k=50,
        top_p=0.95,
    )

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
cell_sentence = "".join(re.split(r"\?|\.|:", output_text)[1:]).strip()
processed_genes = post_process_generated_cell_sentences(cell_sentence, gene_dictionary)

高级用法

文档未提及高级用法相关内容，此处跳过。

📚 详细文档

评估指标

本模型在KNN分类和Gromov - Wasserstein（GW）距离上进行了评估。生成细胞的标签是其生成提示中对应的细胞类型。真实细胞是从保留的测试数据集中有放回采样得到的。生成的细胞使用论文中描述的方法转换为表达向量。完整的实验细节请参考论文。

模型	k=3 NN (↑)	k=5 NN (↑)	k=10 NN (↑)	k=25 NN (↑)	GW (↓)
scGEN	0.2376	0.2330	0.2377	0.2335	315.9505
scVI	0.2436	0.2400	0.2425	0.2348	302.1285
scDiffusion	0.2335	0.2288	0.2368	0.2306	72.0208
scGPT	0.1838	0.1788	0.1811	0.1882	2989.8066
C2S (Pythia-160m)	0.2588	0.2565	0.2746	0.2715	54.3040