语言: 中文
标签:
- tapas
许可证: apache-2.0
数据集:
- msr_sqa
基于顺序问答(SQA)微调的TAPAS迷你模型
该模型有两个可用版本。默认版本对应原始Github仓库中的tapas_sqa_inter_masklm_mini_reset
检查点。此模型在MLM(掩码语言建模)及作者称为中间预训练的额外步骤上进行了预训练,随后在SQA数据集上微调。它采用了相对位置嵌入(即在表格的每个单元格重置位置索引)。
另一个(非默认)可用版本为:
no_reset
,对应tapas_sqa_inter_masklm_mini
(中间预训练,绝对位置嵌入)。
免责声明:发布TAPAS的团队未为此模型编写模型卡,故本模型卡由Hugging Face团队及贡献者撰写。
SQA开发集准确率结果
模型描述
TAPAS是一个类似BERT的Transformer模型,通过自监督方式在维基百科的大规模英文表格及相关文本语料上进行预训练。这意味着它仅基于原始表格及关联文本进行预训练,无需人工标注(因此可利用大量公开数据),并通过自动流程从文本生成输入和标签。具体而言,其预训练目标包括:
- 掩码语言建模(MLM):给定(扁平化的)表格及上下文,模型随机掩码15%的输入词汇,随后处理整个(部分掩码的)序列并预测被掩码的词汇。与传统RNN逐词处理或GPT等自回归模型内部掩码未来标记不同,此方法使模型能学习表格及相关文本的双向表征。
- 中间预训练:为增强表格数值推理能力,作者基于数百万条合成训练样本构建平衡数据集进行额外预训练。模型需判断句子是否被表格内容支持或反驳,训练样本包含合成及反事实陈述。
通过这种方式,模型学习了表格及相关英文文本的内部表征,可用于下游任务特征提取,如表格问答或句子与表格内容的蕴含关系判断。微调时在预训练模型顶部添加单元格选择头,并与基础模型在SQA上联合训练该随机初始化的分类头。
预期用途与限制
该模型适用于对话场景下的表格问答任务。代码示例请参阅HuggingFace官网的TAPAS文档。
训练流程
预处理
文本经小写化处理,并使用30,000词表的WordPiece分词。模型输入格式为:
[CLS] 问题 [SEP] 扁平化表格 [SEP]
微调
模型在32个Cloud TPU v3核心上微调200,000步,最大序列长度512,批量大小128。此配置下微调耗时约20小时,使用学习率1.25e-5、预热比例0.2的Adam优化器。通过添加归纳偏置约束模型仅选择同列单元格(体现为TapasConfig的select_one_column
参数),详见原论文表12。
BibTeX引用信息
@misc{herzig2020tapas,
title={TAPAS: Weakly Supervised Table Parsing via Pre-training},
author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
year={2020},
eprint={2004.02349},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
@misc{eisenschlos2020understanding,
title={Understanding tables with intermediate pre-training},
author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
year={2020},
eprint={2010.00571},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@InProceedings{iyyer2017search-based,
author = {Iyyer, Mohit and Yih, Scott Wen-tau and Chang, Ming-Wei},
title = {Search-based Neural Structured Learning for Sequential Question Answering},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
year = {2017},
month = {July},
abstract = {Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions.},
publisher = {Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/},
}