Parrot开源释义框架 - 基于T5生成高质量释义，加速NLU模型训练

首页

Parrot Paraphraser On T5

由 prithivida 开发

Parrot是一个基于T5的释义框架，专为加速训练自然语言理解(NLU)模型而设计，通过生成高质量释义实现数据增强。

文本生成

Transformers

#NLU训练加速 #多参数可控释义 #对话界面增强

下载量 910.07k

发布时间 : 3/2/2022

模型简介

Parrot是一个话语增强框架，通过生成保留语义的多样化释义来扩充NLU训练数据，支持调节充分性、流畅性和多样性参数。

模型特点

三指标优化

同时优化释义的充分性（语义保留）、流畅性（语法正确）和多样性（词汇/句法变化）

参数可调

支持调节多样性排名器、返回短语数量、长度限制等参数以适应不同需求

NLU专用增强

专注于对话系统输入文本的增强，生成适合NLU模型训练的短文本（最大长度32）

模型能力

文本释义生成

自然语言理解数据增强

多语言文本改写

使用案例

对话系统开发

意图分类数据扩充

为有限标注的意图分类任务生成多样化训练样本

提升模型泛化能力，减少过拟合

槽位保留增强

生成保留关键实体槽位的释义变体

在不破坏标注结构的前提下扩充数据

教育应用

语言学习材料生成

为同一问题创建多种表达方式

帮助学习者掌握多样化表达

🚀 Parrot

Parrot是一个基于释义的话语扩充框架，专为加速自然语言理解（NLU）模型的训练而设计。释义框架不仅仅是一个释义模型。有关该库及其使用的更多详细信息，请参考GitHub页面。

🚀 快速开始

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

''' 
取消注释以获得可复现的释义生成
def random_state(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

random_state(1234)
'''

# 初始化模型（如果将其集成到代码中，请确保仅初始化一次）
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase)
  for para_phrase in para_phrases:
   print(para_phrase)

运行上述代码，示例输出如下：

----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?

📦 安装指南

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

💻 使用示例

基础用法

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

# 初始化模型（如果将其集成到代码中，请确保仅初始化一次）
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase)
  for para_phrase in para_phrases:
   print(para_phrase)

高级用法

para_phrases = parrot.augment(input_phrase=phrase, 
                               diversity_ranker="levenshtein",
                               do_diverse=False, 
                               max_return_phrases = 10, 
                               max_length=32, 
                               adequacy_threshold = 0.99, 
                               fluency_threshold = 0.90)

✨ 主要特性

填补现有释义工具的空白

Huggingface列出了12个释义模型，RapidAPI列出了7个收费和商业释义工具，如QuillBot，Rasa在此处讨论了一个用于扩充文本数据的实验性释义工具，Sentence - transfomers提供了一个释义挖掘工具，NLPAug通过PPDB（一个包含数百万条释义的数据库）提供词级扩充。虽然这些释义尝试都很不错，但仍存在一些差距，释义在构建NLU模型时还不是文本扩充的主流选择。Parrot旨在填补这些空白。

可控制释义质量

一个好的释义需要满足三个关键指标：

充分性（是否充分保留了原意？）
流畅性（释义是否是流畅的英语？）
多样性（词汇/短语/句法）（释义对原句的改动有多大？）

Parrot提供了参数来根据你的需求控制充分性、流畅性和多样性。

优秀的扩充能力

对于训练NLU模型，我们不仅需要大量的话语，还需要带有意图和槽位/实体标注的话语。一个好的扩充器应具备以下能力：

给定一个输入话语 + 输入标注，能够输出N个释义话语，同时保留意图和槽位。
输出的释义话语随后使用步骤1中的输入标注转换为标注数据。
由输出释义话语创建的标注数据可作为NLU模型的训练数据集。

一般来说，作为生成模型的释义器不能保证保留槽位/实体。因此，Parrot能够在不牺牲意图和槽位的情况下，以受限的方式生成高质量的释义，使其成为一个优秀的扩充器。

🔧 技术细节

适用场景

在对话引擎领域，知识机器人用于回答问题，如“柏林墙是什么时候拆除的？”，事务机器人用于执行命令，如“请打开音乐”，语音助手则可以同时回答问题和执行命令。Parrot主要专注于扩充输入到或说给对话界面的文本，以构建强大的NLU模型。（通常人们不会向对话界面输入或说出长篇段落，因此预训练模型是在最大长度为32的文本样本上进行训练的。）