bart-base-detox开源文本净化模型 - 免费将有毒文本改写为中性表达

首页

Bart Base Detox

由 s-nlp 开发

基于BART架构的文本净化模型，能够将有毒文本改写为中性表达

机器翻译

Transformers

英语#文本净化 #平行数据训练 #有毒内容改写

下载量 2,039

发布时间 : 3/2/2022

模型简介

该模型基于BART架构，在ParaDetox平行净化数据集上训练完成，专门用于文本净化任务，可将含有攻击性或不当语言的文本转换为中性表达。

模型特点

平行数据训练

使用ParaDetox平行数据集训练，包含超过10,000条有毒-中性语句对

最先进性能

在文本净化任务上表现优于无监督模型

多领域适用

可应用于社交媒体、论坛评论等多种场景的文本净化

模型能力

文本净化

文本改写

中性化表达生成

使用案例

内容审核

社交媒体评论净化

自动检测并改写社交媒体中的攻击性评论

将有毒评论转换为中性表达，同时保留原意

在线社区管理

论坛发言净化

自动处理论坛中的不当言论

维护社区讨论环境的文明性

🚀 排毒模型（bart-base-detox）

这是一个用于文本排毒任务的模型，基于BART基础模型在并行排毒数据集ParaDetox上训练，在排毒任务中取得了SOTA效果。

🚀 快速开始

本模型是在论文 "ParaDetox: Detoxification with Parallel Data" 中提出的。它基于 BART (base) 模型，在并行排毒数据集ParaDetox上进行训练，在排毒任务中达到了当前最优结果。更多详细信息、代码和数据可在此处找到。

📦 安装指南

文档未提及具体安装步骤，跳过此章节。

💻 使用示例

基础用法

from transformers import BartForConditionalGeneration, AutoTokenizer
base_model_name = 'facebook/bart-base'
model_name = 's-nlp/bart-base-detox'
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

input_ids = tokenizer.encode('This is completely idiotic!', return_tensors='pt')
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
# This is unwise!

📚 详细文档

模型信息

属性	详情
模型类型	BART (base)
训练数据	s-nlp/paradetox
基础模型	facebook/bart-base
许可证	OpenRAIL++

引用信息

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}