license: gemma
library_name: transformers
pipeline_tag: text2text-generation
tags:
- conversational
- neo4j
- cypher
- text2cypher
base_model: google/gemma-2-9b-it
datasets:
- neo4j/text2cypher-2024v1
language:
- en
模型卡片
模型详情
模型描述
本模型展示了如何通过Neo4j-Text2Cypher(2024)数据集(链接)微调基础模型来提升Text2Cypher任务的性能。
请注意,这是持续研究探索的一部分,旨在展示数据集的潜力而非生产级解决方案。
基础模型: google/gemma-2-9b-it
数据集: neo4j/text2cypher-2024v1
微调模型概览与基准测试结果详见链接1和链接2
如有建议或见解,请联系:Neo4j/Team-GenAI
偏差、风险与限制
需注意以下风险:
- 当前评估环境中,训练集与测试集来自相同数据分布(从更大数据集中采样)。若数据分布变化,结果可能不符合相同规律。
- 所用数据集来自公开渠道。随着时间的推移,基础模型可能接触到训练和测试数据,可能获得相似甚至更好的结果。
相关博客文章参阅:链接
训练详情
训练流程
使用RunPod配置如下:
- 1块A100 PCIe显卡
- 31个vCPU核心 117GB内存
- runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04镜像
- 按需安全云服务
- 60GB磁盘空间
- 60GB容器存储卷
训练超参数
- LoRA配置:
r=64,
lora_alpha=64,
target_modules=目标模块,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
- SFT配置:
dataset_text_field=数据集文本字段,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
dataset_num_proc=16,
max_seq_length=1600,
logging_dir="./logs",
num_train_epochs=1,
learning_rate=2e-5,
save_steps=5,
save_total_limit=1,
logging_steps=5,
output_dir="outputs",
optim="paged_adamw_8bit",
save_strategy="steps"
- 量化配置:
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
框架版本
Cypher生成示例
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DavidLanz/text2cypher-gemma-2-9b-it-finetuned-2024v1"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
question = "汤姆·汉克斯演过哪些电影?"
schema = "(:Actor)-[:ActedIn]->(:Movie)"
instruction = (
"生成查询图数据库的Cypher语句。"
"仅使用模式中提供的关系类型和属性。\n"
"模式: {schema} \n 问题: {question} \n Cypher输出: "
)
prompt = instruction.format(schema=schema, question=question)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("生成的Cypher查询:", generated_text)
def prepare_chat_prompt(question, schema):
chat = [
{
"role": "user",
"content": instruction.format(
schema=schema, question=question
),
}
]
return chat
def _postprocess_output_cypher(output_cypher: str) -> str:
partition_by = "**说明:**"
output_cypher, _, _ = output_cypher.partition(partition_by)
output_cypher = output_cypher.strip("`\n")
output_cypher = output_cypher.lstrip("cypher\n")
output_cypher = output_cypher.strip("`\n ")
return output_cypher
new_message = prepare_chat_prompt(question=question, schema=schema)
try:
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
chat_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
final_cypher = _postprocess_output_cypher(chat_generated_text)
print("处理后的Cypher查询:", final_cypher)
except AttributeError:
print("错误:该tokenizer不支持`apply_chat_template`方法,请检查兼容性")