language: zh
tags:
- 多表问答
- 多表格问答
license: mit
pipeline_tag: 表格问答
datasets:
- vaishali/spider-tableQA
MultiTabQA(基础规模模型)
MultiTabQA由Vaishali Pal、Andrew Yates、Evangelos Kanoulas和Maarten de Rijke在论文《MultiTabQA:为多表问答生成表格答案》中提出。原始代码库可在此处找到。
模型描述
MultiTabQA是一个表格问答模型,能够从多个输入表中生成答案表格。它可以处理多表操作符,如UNION、INTERSECT、EXCEPT、JOINS等。
MultiTabQA基于TAPEX(BART)架构,包含一个双向(类似BERT)编码器和一个自回归(类似GPT)解码器。
预期用途
您可以使用原始模型对多个输入表执行SQL查询。该模型已在Spider数据集上进行了微调,能够回答涉及多个输入表的自然语言问题。
使用方法
以下是使用该模型进行转换的示例代码:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base")
model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base")
question = "有多少个部门由未提及的负责人领导?"
table_names = ['department', 'management']
tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
"index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
"data":[
[1,"State","1789",1,9.96,30266.0],
[2,"Treasury","1789",2,11.1,115897.0],
[3,"Defense","1947",3,439.3,3000000.0],
[4,"Justice","1870",4,23.4,112557.0],
[5,"Interior","1849",5,10.7,71436.0],
[6,"Agriculture","1889",6,77.6,109832.0],
[7,"Commerce","1903",7,6.2,36000.0],
[8,"Labor","1913",8,59.7,17347.0],
[9,"Health and Human Services","1953",9,543.2,67000.0],
[10,"Housing and Urban Development","1965",10,46.2,10600.0],
[11,"Transportation","1966",11,58.0,58622.0],
[12,"Energy","1977",12,21.5,116100.0],
[13,"Education","1979",13,62.8,4487.0],
[14,"Veterans Affairs","1989",14,73.2,235000.0],
[15,"Homeland Security","2002",15,44.6,208000.0]
]
},
{"columns":["department_ID","head_ID","temporary_acting"],
"index":[0,1,2,3,4],
"data":[
[2,5,"Yes"],
[15,4,"Yes"],
[2,6,"Yes"],
[7,3,"No"],
[11,10,"No"]
]
}]
input_tables = [pd.read_json(table, orient="split") for table in tables]
model_input_string = """有多少个部门由未提及的负责人领导? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
inputs = tokenizer(model_input_string, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
如何微调
微调脚本请参见此处。
BibTeX条目和引用信息
@inproceedings{pal-etal-2023-multitabqa,
title = "{M}ulti{T}ab{QA}:为多表问答生成表格答案",
author = "Pal, Vaishali and
Yates, Andrew and
Kanoulas, Evangelos and
de Rijke, Maarten",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.348",
doi = "10.18653/v1/2023.acl-long.348",
pages = "6322--6334",
abstract = "近年来,基于大型语言模型的表格问答(QA)在覆盖范围上受到限制,仅能回答单个表格的问题。然而,现实中的查询通常涉及复杂的关系数据库或网页中的多个表格。单表问题不涉及常见的表格操作,如集合操作、笛卡尔积(连接)或嵌套查询。此外,多表操作通常会产生表格输出,这就要求表格QA模型具备表格生成能力。为填补这一空白,我们提出了一个回答多表问题的新任务。我们的模型MultiTabQA不仅能回答多表问题,还能推广到生成表格答案。为了有效训练,我们构建了一个包含132,645条SQL查询和表格答案的预训练数据集。此外,我们通过引入不同严格程度的表格特定指标来评估生成的表格,这些指标评估了表格结构的各个粒度级别。MultiTabQA在Spider、Atis和GeoQuery三个数据集上的微调表现优于适应多表QA设置的最先进单表QA模型。",
}