T5-LM-Large-text2sql-spider开源模型 - 免费将文本转换为可执行SQL查询

首页

T5 LM Large Text2sql Spider

由 gaussalgo 开发

基于T5-large-LM-adapt微调的文本到SQL转换模型，通过整合数据库表结构信息生成可执行SQL查询

大型语言模型

Transformers

英语#结构化查询生成 #数据库感知 #自然语言转SQL

下载量 2,124

发布时间 : 4/25/2023

模型简介

该模型能够根据自然语言问题和数据库表结构生成结构化的SQL查询语句，特别适用于数据库查询场景。

模型特点

数据库结构整合

在训练过程中将数据库表结构整合至输入问题中，明确指定可用的数据列和关联关系

跨数据库泛化能力

能够处理训练数据中未出现过的数据库结构，具有良好的泛化性能

可执行SQL生成

生成的SQL查询可直接在目标数据库上执行，避免了未知列名等问题

模型能力

自然语言到SQL转换

数据库查询生成

结构化数据访问

使用案例

数据库查询

音乐家信息查询

根据国籍查询音乐家的平均、最小和最大年龄

生成SQL: SELECT avg(年龄), min(年龄), max(年龄) FROM 歌手 WHERE 国籍 = '法国'

数据报表生成

统计报表生成

根据自然语言描述生成各类统计报表的SQL查询

🚀 T5大语言模型适配文本转SQL

本模型旨在根据自然语言提示生成结构化的SQL查询。它通过学习自然语言问题来生成对应的SQL查询，同时在训练时将数据库模式融入输入问题，使模型能更好地考虑特定数据库的结构，从而生成适用的SQL查询。

🚀 快速开始

本模型用于文本转SQL任务，能够根据自然语言问题生成对应的SQL查询。在训练过程中，我们将数据库模式信息加入到输入问题中，让模型学习模式与预期输出的映射，从而更好地泛化到训练数据中未出现的模式。

✨ 主要特性

结合数据库模式：在训练时将数据库模式融入输入问题，使模型能考虑特定数据库的结构，生成适用的SQL查询。
更好的泛化能力：通过学习模式与预期输出的映射，模型能更好地泛化到训练数据中未出现的模式。

📦 安装指南

文档未提及安装步骤，故跳过。

💻 使用示例

基础用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = 'gaussalgo/T5-LM-Large-text2sql-spider'
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

question = "What is the average, minimum, and maximum age for all French musicians?"
schema = """
   "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int , "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text , "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool , foreign_key:  primary key: "Singer_ID" [SEP] "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium" "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert"  foreign_key: "concert_ID" int from "concert" "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"
"""

input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**model_inputs, max_length=512)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text)

输出：

SQL Query:
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'

📚 详细文档

数据集

本模型在Spider和Spider-Syn数据集的训练分割上进行了微调。在输入中，除了问题本身，还添加了数据库模式，以便模型能针对给定数据库生成查询。

输入提示示例：

Question:  What is the average, minimum, and maximum age for all French musicians?
Schema: "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int ,
        "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text ,
        "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool ,
        foreign_key:  primary key: "Singer_ID" [SEP],
        "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium",
        "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert",
        foreign_key: "concert_ID" int from "concert",
        "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"

预期输出示例：

SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'

数据库模式格式

模型训练使用的标准化数据库模式格式如下：

table_name column1_name column1_type column2_name column2_type ... foreign_key: FK_name FK_type from table_name column_name primary key: column_name [SEP]
table_name2 ...

评估

评估在Spider和Spider-syn数据集的开发分割上进行。开发分割中的数据库与训练分割中的数据库没有交集，以确保模型在训练过程中未接触到评估的数据库。评估通过比较使用生成查询和参考查询对数据库进行查询的结果来进行。Spider和Spider-Syn开发分割均有1032个样本。

Spider开发集准确率：49.2%
Spider Syn开发集准确率：39.5%

训练

模型使用Adaptor库 0.2.1在Spider和Spider-syn数据集的训练分割上进行训练，参数如下：

training_arguments = AdaptationArguments(output_dir="train_dir",
                                         learning_rate=5e-5,
                                         stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
                                         stopping_patience=8,
                                         save_total_limit=8,
                                         do_train=True,
                                         do_eval=True,
                                         bf16=True,
                                         warmup_steps=1000,
                                         gradient_accumulation_steps=8,
                                         logging_steps=10,
                                         eval_steps=200,
                                         save_steps=1000,
                                         num_train_epochs=10,
                                         evaluation_strategy="steps")

训练过程相对容易复现，但我们不希望发布其依赖的修改后的Spider数据集副本。如果您想进一步研究，请通过新的PR或发送电子邮件至stefanik(at)gaussalgo.com与我们联系。

🔧 技术细节

本模型基于t5-large-LM-adapt检查点进行微调。在文本转SQL任务中，模型通常需要根据自然语言问题生成SQL查询，但有时生成的查询可能包含未知列等问题，且未考虑特定数据库的模式。我们的方法是在训练时将数据库模式融入输入问题，让模型学习模式与预期输出的映射，从而更好地泛化到训练数据中未出现的模式。