pip-sql-1.3b开源SQL生成模型 - 免费使用，性能超越多数专家模型和ChatGPT

首页

Pip Sql 1.3b

由 PipableAI 开发

一个13亿参数的SQL生成模型，在多个流行基准测试中超越了大多数SQL专家模型和ChatGPT。

大型语言模型支持多种语言开源协议:Apache-2.0 #文本转SQL生成 #高效SQL查询 #数据库交互

下载量 1,288

发布时间 : 2/14/2024

模型简介

基于deepseek基础模型提炼的文本到SQL转换模型，能够根据自然语言问题和数据库模式生成SQL查询。

模型特点

高性能SQL生成

在Spider、SParC和CoSQL等基准测试中表现优于同类模型和ChatGPT

高效参数规模

仅13亿参数即达到优异性能，相比更大规模模型更具效率优势

多框架支持

支持PyTorch和JAX/Flax两种主流深度学习框架

模型能力

自然语言转SQL

数据库查询生成

复杂SQL语句构建

使用案例

数据库管理

业务数据分析

非技术人员通过自然语言查询数据库

自动生成准确的SQL查询语句

数据库应用开发

快速原型开发中自动生成数据库查询代码

减少SQL编写时间，提高开发效率

🚀 pipSQL-1.3b

pipSQL-1.3b是一个拥有13亿参数的SQL模型，在流行基准测试中表现优于大多数SQL专家模型和ChatGPT。它基于DeepSeek基础模型构建，为文本到SQL的转换提供了强大支持。

🚀 快速开始

你可以通过以下链接体验本项目：

✨ 主要特性

一个拥有13亿参数的SQL模型，在流行基准测试中超越了大多数SQL专家模型和ChatGPT。
这是一个基于DeepSeek基础模型构建的蒸馏模型。
有关我们的先进模型，请参考PipableAI/pip-library-etl-1.3b。

🔧 技术细节

模型构建方法

我们使用了softmax交叉熵、改进形式的策略梯度以及Q损失，并在EM设置中进行优化。以下是上述设置中的损失行为：

image/png

基准测试

为了进行基准测试，我们使用了由耶鲁大学和伯克利大学的研究团队提出的“Semantic Evaluation for Text-to-SQL with Distilled Test Suites”，这是一个被官方认可的用于Spider、SParC和CoSQL的评估框架。该基准包含2200个测试数据点。

Test Suite SQL Eval

模型	简单	中等	困难	额外
sqlcoder-7b-2	72.0	58.0	40.6	37.3
pipSQL-1.3b	78.5	57.5	42.1	28.3
pipSQL-7b	63.0	40.0	30.2	25.0
sqlcoder-7b	60.6	48.2	28.3	20.4
gpt-3.5	58.8	44.7	31.0	28.4

我们还在Defog评估上进行了基准测试，该评估包含Defog团队精心挑选的200个测试数据点。

Defog SQL-Eval

团队成员

Avi Kothari、Pratham Gupta、Ritvik Aryan Kalra、Rohan Bhatial、Soham Acharya

📦 安装指南

pip install transformers

💻 使用示例

基础用法

prompt = f"""<schema>{schema}</schema>
<question>{question}</question>
<sql>"""

高级用法 - PyTorch

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b")
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

高级用法 - Flax

from transformers import FlaxAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")

inputs = tokenizer(text, return_tensors="jax")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

示例数据与查询

数据库表结构

CREATE TABLE Products (
  product_id number,
  parent_product_id number,
  product_name text,
  product_price number,
  product_color text,
  product_size text,
  product_description text);

CREATE TABLE Customers (
  customer_id number,
  gender_code text,
  customer_first_name text,
  customer_middle_initial text,
  customer_last_name text,
  email_address text,
  login_name text,
  login_password text,
  phone_number text,
  address_line_1 text,
  town_city text,
  county text,
  country text);

CREATE TABLE Customer_Payment_Methods (
  customer_id number,
  payment_method_code text);

CREATE TABLE Invoices (
  invoice_number number,
  invoice_status_code text,
  invoice_date time);

CREATE TABLE Orders (
  order_id number,
  customer_id number,
  order_status_code text,
  date_order_placed time);

CREATE TABLE Order_Items (
  order_item_id number,
  product_id number,
  order_id number,
  order_item_status_code text);

CREATE TABLE Shipments (
  shipment_id number,
  order_id number,
  invoice_number number,
  shipment_tracking_number text,
  shipment_date time);

CREATE TABLE Shipment_Items (
  shipment_id number,
  order_item_id number);

查询示例

问题1：最不常见性别客户的电子邮件地址、所在城镇和所在县是什么？

SELECT email_address ,  town_city ,  county FROM customers GROUP BY gender_code ORDER BY count(*) ASC LIMIT 1

问题2：价格高于平均水平的产品的价格和尺寸是多少？

SELECT product_price ,  product_size FROM products WHERE product_price  > (SELECT avg(product_price) FROM products)

问题3：哪些客户没有下过任何订单？列出他们的名字、中间名首字母和姓氏。

SELECT T1.customer_first_name ,  T1.customer_middle_initial ,  T1.customer_last_name FROM Customers AS T1 WHERE T1.customer_id NOT IN (SELECT T2.customer_id FROM Orders AS T2)

📄 许可证

该模型遵循Apache 2.0许可证开源。

属性	详情
模型类型	基于DeepSeek的蒸馏SQL模型
训练数据	PipableAI/pip-txt-to-sql-spider-bird-dataset
评估指标	准确率
标签	sql、code、text2sql、instruction_tuned、basemodel、jax、pytorch、text-generation-inference
库名称	transformers
任务类型	文本生成