Text2Cypher-gemma-2-9b-it-finetuned-2024v1开源模型 - 免费将自然语言转Cypher查询语句

首页

Text2cypher Gemma 2 9b It Finetuned 2024v1

由 neo4j 开发

该模型是基于google/gemma-2-9b-it微调的Text2Cypher模型，能够将自然语言问题转换为Neo4j图数据库的Cypher查询语句。

知识图谱

Safetensors

英语开源协议:Apache-2.0 #自然语言转Cypher查询 #Neo4j图数据库交互 #LoRA高效微调

下载量 2,093

发布时间 : 9/10/2024

模型简介

该模型展示了如何利用Neo4j-Text2Cypher(2024)数据集对基础模型进行微调，以提升Text2Cypher任务的性能。主要用于将自然语言问题转换为Cypher查询语句。

模型特点

高效的自然语言到Cypher转换

能够准确地将自然语言问题转换为有效的Cypher查询语句

LoRA微调技术

使用参数高效微调技术(LoRA)进行模型适配，保持基础模型能力的同时提升特定任务表现

4-bit量化支持

支持4-bit量化推理，降低硬件资源需求

模型能力

自然语言理解

Cypher查询生成

图数据库交互

使用案例

图数据库查询

演员电影查询

查询特定演员参演的所有电影

生成正确的MATCH (a:Actor)-[:ActedIn]->(m:Movie) RETURN m查询

复杂关系查询

查询满足特定条件的复杂关系路径

根据模式生成多跳查询语句

数据分析

图数据统计

生成统计图数据特征的查询

生成包含COUNT、SUM等聚合函数的查询

🚀 文本到Cypher生成模型

本模型展示了如何使用Neo4j-Text2Cypher(2024)数据集微调基础模型，以提升文本到Cypher任务的性能。这是正在进行的研究和探索的一部分，旨在凸显该数据集的潜力，而非提供一个可用于生产的解决方案。

🚀 快速开始

你可以使用以下代码示例开始使用该模型：

from peft import PeftModel, PeftConfig
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)


instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)


def prepare_chat_prompt(question, schema) -> list[dict]:
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation. E.g.  MATCH...\n\n**Explanation:**\n\n -> MATCH...
    # Remove cypher indicator. E.g.```cypher\nMATCH...```` --> MATCH...
    # Note: Possible to have both:
    #   E.g. ```cypher\nMATCH...````\n\n**Explanation:**\n\n --> MATCH...
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

# Model
model_name = "neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)

# Question
question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)" # Check the NOTE below on creating your own schemas
new_message = prepare_chat_prompt(question=question, schema=schema)
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True)

# Any other parameters
model_generate_parameters = {
    "top_p": 0.9,
    "temperature": 0.2,
    "max_new_tokens": 512,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

inputs.to(model.device)
model.eval()
with torch.no_grad():
    tokens = model.generate(**inputs, **model_generate_parameters)
    tokens = tokens[:, inputs.input_ids.shape[1] :]
    raw_outputs = tokenizer.batch_decode(tokens, skip_special_tokens=True)
    outputs = [_postprocess_output_cypher(output) for output in raw_outputs]
    
print(outputs)
> ["MATCH (a:Actor {Name: 'Tom Hanks'})-[:ActedIn]->(m:Movie) RETURN m"]

✨ 主要特性

该模型展示了使用Neo4j-Text2Cypher(2024)数据集微调基础模型，以提升文本到Cypher任务性能的方法。
这是正在进行的研究和探索的一部分，旨在凸显该数据集的潜力。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

# 基础用法代码示例
from peft import PeftModel, PeftConfig
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)


instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)


def prepare_chat_prompt(question, schema) -> list[dict]:
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation. E.g.  MATCH...\n\n**Explanation:**\n\n -> MATCH...
    # Remove cypher indicator. E.g.```cypher\nMATCH...```` --> MATCH...
    # Note: Possible to have both:
    #   E.g. ```cypher\nMATCH...````\n\n**Explanation:**\n\n --> MATCH...
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

# Model
model_name = "neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)

# Question
question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)" # Check the NOTE below on creating your own schemas
new_message = prepare_chat_prompt(question=question, schema=schema)
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True)

# Any other parameters
model_generate_parameters = {
    "top_p": 0.9,
    "temperature": 0.2,
    "max_new_tokens": 512,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

inputs.to(model.device)
model.eval()
with torch.no_grad():
    tokens = model.generate(**inputs, **model_generate_parameters)
    tokens = tokens[:, inputs.input_ids.shape[1] :]
    raw_outputs = tokenizer.batch_decode(tokens, skip_special_tokens=True)
    outputs = [_postprocess_output_cypher(output) for output in raw_outputs]
    
print(outputs)
> ["MATCH (a:Actor {Name: 'Tom Hanks'})-[:ActedIn]->(m:Movie) RETURN m"]

高级用法

文档未提及高级用法代码示例，故跳过此部分。

📚 详细文档

模型详情

本模型展示了如何使用Neo4j-Text2Cypher(2024)数据集微调基础模型，以提升文本到Cypher任务的性能。需要注意的是，这是正在进行的研究和探索的一部分，旨在凸显该数据集的潜力，而非提供一个可用于生产的解决方案。

基础模型：google/gemma-2-9b-it 数据集：neo4j/text2cypher-2024v1

微调模型的概述和基准测试结果可在Link1和Link2查看。

如果你有想法或见解，请联系我们：Neo4j/Team-GenAI

偏差、风险和局限性

我们需要注意以下几点风险：

在我们的评估设置中，训练集和测试集来自相同的数据分布（从更大的数据集中采样）。如果数据分布发生变化，结果可能不会遵循相同的模式。
所使用的数据集是从公开可用的来源收集的。随着时间的推移，基础模型可能会访问训练集和测试集，从而可能获得相似甚至更好的结果。

另请查看相关博客文章：Link

训练详情

训练过程

使用了RunPod，并进行了以下设置：

1 x A100 PCIe
31 vCPU 117 GB RAM
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
On-Demand - Secure Cloud
60 GB Disk
60 GB Pod Volume

训练超参数

lora_config = LoraConfig( r=64, lora_alpha=64, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
sft_config = SFTConfig( dataset_text_field=dataset_text_field, per_device_train_batch_size=4, gradient_accumulation_steps=8, dataset_num_proc=16, max_seq_length=1600, logging_dir="./logs", num_train_epochs=1, learning_rate=2e-5, save_steps=5, save_total_limit=1, logging_steps=5, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="steps", )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

框架版本

PEFT 0.12.0

关于创建自己的模式的注意事项

在我们使用的数据集中，模式已经提供。它们可以通过以下方式创建：
- 直接使用输入数据源提供的模式；
- 使用neo4j-graphrag包创建模式（请查看：SchemaReader.get_schema(...)函数）。
在你自己的Neo4j数据库中，你可以使用neo4j-graphrag package::SchemaReader函数。