TxGemma-9B-Chat开源语言模型 - 免费助力治疗开发，多规模可选

首页

Txgemma 9b Chat

由 google 开发

TxGemma是基于Gemma 2构建的轻量级开源语言模型，专为治疗开发而微调，提供2B、9B和27B三种规模。

大型语言模型

Transformers

英语开源协议:其他 #药物属性预测 #治疗开发对话 #多模态治疗理解

下载量 4,111

发布时间 : 3/21/2025

模型简介

TxGemma是一组专为治疗开发优化的语言模型，能够处理和理解与各种治疗方式和目标相关的信息，包括小分子、蛋白质、核酸、疾病和细胞系。

模型特点

多功能性

在广泛的治疗任务中表现出色，在大量基准测试中优于或匹配最佳性能。

数据效率

与大型模型相比，即使在数据有限的情况下也表现出竞争力。

对话能力

对话变体（TxGemma-Chat）可进行自然语言对话并解释其预测逻辑。

微调基础

可作为预训练基础用于专门用例。

模型能力

治疗属性预测

自然语言对话

药物-靶点相互作用预测

临床试验批准预测

使用案例

药物发现

血脑屏障穿透预测

根据药物SMILES字符串预测其是否能够穿过血脑屏障。

在BBB_Martins任务中表现优异

目标识别

识别潜在的治疗靶点。

治疗开发

药物属性预测

预测各种治疗方式和目标的属性。

🚀 TxGemma模型卡片

TxGemma是基于Gemma 2构建的轻量级、最先进的开放语言模型集合，针对治疗开发进行了微调。它有2B、9B和27B三种规模，能处理和理解与各种治疗方式和靶点相关的信息，可用于药物发现等多个治疗领域的任务。

🚀 快速开始

若要在Hugging Face上访问TxGemma，你需要查看并同意Health AI Developer Foundation的使用条款。请确保你已登录Hugging Face，然后点击下方按钮。请求将立即处理。

访问按钮

Acknowledge license

资源链接

模型文档：TxGemma
Google Cloud Model Garden上的模型：TxGemma
Hugging Face上的模型：TxGemma
GitHub仓库：TxGemma
快速开始笔记本：notebooks/quick_start
支持信息：请查看Contact

使用条款

Health AI Developer Foundations使用条款

作者

Google

✨ 主要特性

模型描述

TxGemma是基于Gemma 2构建的轻量级、最先进的开放语言模型集合，针对治疗开发进行了微调。它有2B、9B和27B三种规模，旨在处理和理解与各种治疗方式和靶点相关的信息，包括小分子、蛋白质、核酸、疾病和细胞系等。TxGemma在属性预测等任务中表现出色，可作为进一步微调的基础，也可作为药物发现的交互式对话代理。该模型使用从Therapeutics Data Commons (TDC)精选的多样化指令调整数据集，从Gemma 2进行微调。

关键特性

多功能性：在广泛的治疗任务中表现出色，在大量基准测试中超越或达到同类最佳性能。
数据效率：与大型模型相比，即使在数据有限的情况下也能展现出有竞争力的性能，相较于其前身有所改进。
对话能力（TxGemma - Chat）：包含能够进行自然语言对话并解释预测推理过程的对话变体。
微调基础：可作为特定用例的预训练基础。

潜在应用

TxGemma对以下领域的研究人员来说是一个有价值的工具：

加速药物发现：通过预测治疗方法和靶点的属性，简化治疗开发过程，适用于各种任务，包括靶点识别、药物 - 靶点相互作用预测和临床试验批准预测。

📦 安装指南

文档未提及具体安装步骤，可参考相关资源链接中的内容进行安装。

💻 使用示例

基础用法

治疗任务提示格式化

import json
from huggingface_hub import hf_hub_download

# Load prompt template for tasks from TDC
tdc_prompts_filepath = hf_hub_download(
    repo_id="google/txgemma-9b-chat",
    filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
    tdc_prompts_json = json.load(f)

# Set example TDC task and input
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"

# Construct prompt using template and input drug SMILES string
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)

运行预测任务模型

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-9b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-9b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

使用pipeline API运行推理

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-9b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Generate response
outputs = pipe(prompt, max_new_tokens=8)
response = outputs[0]["generated_text"]
print(response)

高级用法

应用聊天模板进行对话使用

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-9b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-9b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format
messages = [
    { "role": "user", "content": prompt}
]

# Apply the tokenizer's built-in chat template
chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

生成响应

inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=8)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

多轮交互

messages.extend([
    { "role": "assistant", "content": response },
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." },
])

chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=512)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

使用pipeline API进行多轮交互

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-9b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format for initial turn
messages = [
    { "role": "user", "content": prompt}
]

# Generate response for initial turn
outputs = pipe(messages, max_new_tokens=8)
print(outputs[0]["generated_text"][-1]["content"].strip())

# Append user prompt for an additional turn
messages = outputs[0]["generated_text"]
messages.append(
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." }
)

# Generate response for additional turn
outputs = pipe(messages, max_new_tokens=512)
print(outputs[0]["generated_text"][-1]["content"].strip())

示例笔记本

快速尝试模型：Colab中的快速开始笔记本，包含一些来自TDC的示例评估任务。
在Hugging Face中微调TxGemma的演示：Colab中的微调笔记本。
将TxGemma作为由Gemini 2驱动的更大代理工作流的一部分的演示：Colab中的代理工作流笔记本。

📚 详细文档

模型架构概述

基础架构：TxGemma基于Gemma 2系列轻量级、最先进的开放大语言模型，采用仅解码器的Transformer架构。
基础模型：Gemma 2（2B、9B和27B参数版本）。
微调数据：Therapeutics Data Commons，这是一个涵盖各种治疗方式和靶点的指令调整数据集集合。
训练方法：使用治疗数据（TxT）的混合进行指令微调，对于对话变体，还使用通用指令调整数据。
对话变体：TxGemma - Chat模型（9B和27B）使用治疗和通用指令调整数据的混合进行训练，以保持对话能力。

技术规格

属性	详情
模型类型	仅解码器的Transformer（基于Gemma 2）
关键出版物	TxGemma: Efficient and Agentic LLMs for Therapeutics
模型创建时间	2025 - 03 - 18（来自TxGemma变体提案）
模型版本	1.0.0

性能与验证

TxGemma的性能已在从TDC派生的66个治疗任务的综合基准测试中得到验证。

关键性能指标

综合改进：在66个治疗任务中的45个任务上，相较于原始的Tx - LLM论文有所改进。
同类最佳性能：在66个任务中的50个任务上超越或达到同类最佳性能，在26个任务上超过专业模型。详情见TxGemma论文的表A.11。

输入和输出

输入：文本。为获得最佳性能，文本提示应根据TDC结构进行格式化，包括指令、上下文、问题，以及可选的少量示例。输入可以包括SMILES字符串、氨基酸序列、核苷酸序列和自然语言文本。
输出：文本。

数据集详情

训练数据集

Therapeutics Data Commons：一个精心策划的指令调整数据集集合，涵盖66个任务，涉及安全有效药物的发现和开发。这包括跨越不同生物医学实体的超过1500万个数据点。已发布的TxGemma模型仅在具有商业许可证的数据集上进行训练，而我们论文中的模型还在具有非商业许可证的数据集上进行训练。
通用指令调整数据：与TDC结合用于TxGemma - Chat。

评估数据集

Therapeutics Data Commons：使用与训练相同的66个任务进行评估，遵循TDC推荐的数据分割方法（随机、支架、冷启动、组合和时间）。

实现信息

软件

训练使用JAX进行。JAX允许研究人员利用最新一代的硬件，包括TPU，以更快、更高效地训练大型模型。

使用和限制

预期用途

治疗方法的研究和开发。

优点

TxGemma为加速治疗开发提供了一个多功能且强大的工具。它具有以下优点：

在广泛的任务中表现出色。
与大型模型相比，数据效率更高。
可作为从私有数据进行进一步微调的基础。
可集成到代理工作流中。

限制

在来自TDC的公共数据上进行训练。
特定任务的验证仍然是最终用户进行下游模型开发的重要方面。
与任何研究一样，开发人员应确保任何下游应用都经过验证，以了解使用与特定应用预期使用环境（例如年龄、性别、病情、扫描仪等）适当代表的数据时的性能。

🔧 技术细节

文档未提供更详细的技术实现细节相关内容。

📄 许可证

TxGemma的使用受Health AI Developer Foundations使用条款的约束。

📖 引用

@article{wang2025txgemma,
    title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
    author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
    year={2025},
}

论文链接：TxGemma: Efficient and Agentic LLMs for Therapeutics