DistilBERT开源问答模型 - 参量少速度快，免费部署精准解答问题

首页

Distilbert Base Cased Distilled Squad

由 distilbert 开发

DistilBERT是BERT的轻量级蒸馏版本，参数量减少40%，速度提升60%，保留95%以上性能。本模型是在SQuAD v1.1数据集上微调的问答专用版本。

问答系统英语开源协议:Apache-2.0 #问答系统 #知识蒸馏 #高效推理

下载量 220.76k

发布时间 : 3/2/2022

模型简介

基于Transformer的轻量级英语问答模型，适用于从给定文本中提取答案的抽取式问答任务。

模型特点

高效轻量

通过知识蒸馏技术，模型体积比原始BERT减少40%，推理速度提升60%

高性能

在SQuAD v1.1验证集上达到87.1的F1分数，接近原始BERT 88.7的表现

专注问答

专门针对抽取式问答任务优化，可直接用于问答系统开发

模型能力

文本理解

问答提取

上下文分析

使用案例

教育科技

自动答题系统

从教材或参考资料中自动提取问题答案

在SQuAD基准测试中达到87.1 F1分数

客户服务

FAQ自动应答

从知识库文档中快速定位问题答案

🚀 DistilBERT基础大小写敏感蒸馏SQuAD模型

DistilBERT基础大小写敏感蒸馏SQuAD模型是基于DistilBERT进行微调的模型，可用于问答任务。它在保持较高性能的同时，具有更小的参数规模和更快的运行速度。

🚀 快速开始

使用以下代码开始使用该模型：

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

>>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

以下是在PyTorch中使用该模型的方法：

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(outputs)

在TensorFlow中的使用方法如下：

from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

✨ 主要特性

DistilBERT模型：DistilBERT模型在博客文章 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT 和论文 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 中被提出。它是一个小型、快速、低成本且轻量级的Transformer模型，通过蒸馏BERT基础模型进行训练。与 bert-base-uncased 相比，它的参数减少了40%，运行速度提高了60%，同时在GLUE语言理解基准测试中保留了BERT超过95%的性能。
微调模型：此模型是 DistilBERT-base-cased 的微调检查点，使用 SQuAD v1.1 上的知识蒸馏（第二步）进行了微调。

📦 安装指南

文档未提供安装步骤，跳过该章节。

💻 使用示例

基础用法

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

>>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

高级用法

以下是在不同深度学习框架中使用该模型的示例，可用于更复杂的场景：

PyTorch

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(outputs)

TensorFlow

from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

📚 详细文档

用途

该模型可用于问答任务。

误用和超出范围的使用

该模型不应被用于故意为人们创造敌对或排斥性的环境。此外，该模型并非用于对人物或事件进行事实性或真实性的表述，因此使用该模型生成此类内容超出了其能力范围。

风险、局限性和偏差

⚠️ 重要提示

读者应注意，该模型生成的语言可能会让一些人感到不安或冒犯，并可能传播历史和当前的刻板印象。

大量研究已经探讨了语言模型的偏差和公平性问题（例如，参见 Sheng et al. (2021) 和 Bender et al. (2021)）。该模型生成的预测可能包含针对受保护类别、身份特征以及敏感、社会和职业群体的令人不安和有害的刻板印象。例如：

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Alice is sitting on the bench. Bob is sitting next to her.
... """

>>> result = question_answerer(question="Who is the CEO?", context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'Bob', score: 0.7527, start: 32, end: 35

用户（直接用户和下游用户）应了解该模型的风险、偏差和局限性。

训练

训练数据

distilbert-base-cased模型使用与 distilbert-base-uncased模型相同的数据进行训练。distilbert-base-uncased模型对其训练数据的描述如下：

DistilBERT在与BERT相同的数据上进行预训练，这些数据包括 BookCorpus（一个由11038本未出版书籍组成的数据集）和英文维基百科（不包括列表、表格和标题）。

要了解有关SQuAD v1.1数据集的更多信息，请参阅 SQuAD v1.1数据卡片。

训练过程

预处理

更多详细信息请参阅 distilbert-base-cased模型卡片。

预训练

更多详细信息请参阅 distilbert-base-cased模型卡片。

评估

如模型仓库中所讨论的：

该模型在 [SQuAD v1.1] 开发集上达到了87.1的F1分数（作为对比，BERT bert-base-cased版本的F1分数为88.7）。

环境影响

可以使用 Lacoste et al. (2019) 中提出的机器学习影响计算器来估算碳排放。我们根据相关论文提供了所使用的硬件类型和时长。请注意，这些细节仅适用于DistilBERT的训练，不包括使用SQuAD进行的微调。

属性	详情
硬件类型	8个16GB V100 GPU
使用时长	90小时
云服务提供商	未知
计算区域	未知
碳排放	未知

技术规格

有关模型架构、目标、计算基础设施和训练细节的详细信息，请参阅相关论文。

引用信息

@inproceedings{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  booktitle={NeurIPS EMC^2 Workshop},
  year={2019}
}

APA格式：

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.