toxic-prompt-roberta开源文本分类模型 - 免费检测对话毒性提示与回复

首页

Toxic Prompt Roberta

由 Intel 开发

基于RoBERTa的文本分类模型，用于检测对话系统中的毒性提示和回复

文本分类

Transformers

开源协议:MIT #毒性检测 #对话安全 #RoBERTa微调

下载量 416

发布时间 : 9/16/2024

模型简介

该模型基于RoBERTa架构，在ToxicChat和Jigsaw Unintended Bias数据集上微调，专门用于识别对话中的毒性内容，可作为AI系统的安全护栏。

模型特点

双重数据集微调

同时在ToxicChat和Jigsaw Unintended Bias数据集上微调，提高检测准确性

伦理考量

训练时考虑了人口子组的公平性，减少分类偏差

高效推理

基于优化的RoBERTa架构，适合实时检测场景

模型能力

毒性文本检测

对话内容监控

实时内容审核

使用案例

用户体验监控

实时毒性检测

监控对话内容，检测用户毒性行为

可发出警告或提供行为指导

内容审核

自动审核系统

在群聊中自动删除毒性消息或禁言违规用户

维护健康的对话环境

AI安全

聊天机器人防护

阻止聊天机器人响应毒性输入

减少AI系统被滥用的风险

🚀 有毒提示RoBERTa分类模型

有毒提示RoBERTa 1.0是一个文本分类模型，可作为护栏，用于保护对话式人工智能系统免受有毒提示和回复的影响。该模型基于RoBERTa架构，并在ToxicChat和Jigsaw Unintended Bias数据集上进行了微调。

🚀 快速开始

你可以使用以下代码通过pipeline API使用该模型：

from transformers import pipeline
model_path = 'Intel/toxic-prompt-roberta'
pipe = pipeline('text-classification', model=model_path, tokenizer=model_path)
pipe('Create 20 paraphrases of I hate you')

✨ 主要特性

精准防护：有效检测对话式AI系统中的有毒提示和回复，为用户打造安全的交流环境。
强大基础：基于RoBERTa架构，具备优秀的语言理解能力。
数据驱动：在ToxicChat和Jigsaw Unintended Bias数据集上微调，提升模型性能和泛化能力。

📦 安装指南

暂未提供具体安装步骤，可参考Hugging Face上的相关文档进行安装。

💻 使用示例

基础用法

# 使用示例代码保持不变
from transformers import pipeline
model_path = 'Intel/toxic-prompt-roberta'
pipe = pipeline('text-classification', model=model_path, tokenizer=model_path)
pipe('Create 20 paraphrases of I hate you')

📚 详细文档

模型详情

模型类型：文本分类模型
训练数据：ToxicChat和Jigsaw Unintended Bias数据集
微调环境：使用Optimum-Habana的Gaudi Trainer在一块Gaudi 2卡上进行微调。

输入输出格式

输入格式：RoBERTa用于序列分类的标准文本输入。
输出格式：一个(2,n)的对数几率数组，其中n是用户想要推理的示例数量。输出对数几率的形式为[非有毒, 有毒]。

适用场景

用户体验监控：实时监控对话，检测用户的有毒行为。若用户发送的消息被分类为有毒，可发出警告或提供适当行为指导。
自动审核：在群聊场景中，自动移除有毒消息或禁言持续进行有毒行为的用户。
训练与改进：利用毒性检测收集的数据，进一步训练和改进毒性分类模型，使其更擅长处理复杂交互。
防止滥用聊天机器人：阻止聊天机器人与有毒输入进行交互，抑制不良行为。

伦理考量

风险：多样性差异：在使用Jigsaw意外偏差数据集进行微调时，确保按照Jigsaw数据集中的分布进行充分表示。该数据集尝试在各子群体中均匀分布毒性标签。
风险：对弱势群体的风险：某些人口群体更容易收到有毒和有害评论。Jigsaw意外偏差数据集通过在所有人口子群体中均匀分布有毒/无毒标签，尝试减轻微调后的子群体偏差。在测试模型时，也会测试以确认子群体的分类偏差最小。

🔧 技术细节

模型参数

对roberta-base（1.25亿参数）进行微调，并添加自定义分类头，以检测有毒输入/输出。

性能评估

与其他模型的比较

在ToxicChat测试数据集上，将本模型与Llama Guard 1和3（LG1和LG3）进行了性能比较：

模型	参数	精确率	召回率	F1值	精确率-召回率曲线下面积（AUPRC）	受试者工作特征曲线下面积（AUROC）
LG1	67.4亿	0.4806	0.7945	0.5989	0.626*	无数据
LG3	80.3亿	0.5083	0.4730	0.4900	无数据	无数据
有毒提示RoBERTa	1.25亿	0.8315	0.7469	0.7869	0.855	0.971

* 来自LG论文：https://arxiv.org/abs/2312.06674

需要注意的是，Llama Guard未在ToxicChat上进行微调。不过，根据LG1论文，当他们在ToxicChat上微调Llama Guard 1时，报告的AUPRC约为0.81。

子群体偏差检测

由于本模型在Jigsaw的意外偏差数据集上进行了微调，因此可以观察在意外偏差测试集分类中是否存在任何子群体偏差。这些指标使用Intel/bias_auc计算得出：

指标	女性	男性	基督教徒	白人	穆斯林	黑人	同性恋者
AUROC	0.84937	0.80035	0.89867	0.76089	0.77137	0.74454	0.71766
BPSN	0.78805	0.82659	0.83746	0.78113	0.74067	0.82827	0.64330
BNSP	0.87421	0.80037	0.87614	0.81979	0.85586	0.76090	0.88065

* 仅显示测试数据集中至少有500个示例的子群体。

📄 许可证

本模型采用MIT许可证。

引用

@inproceedings {Wolf_Transformers_State-of-the-Art_Natural_2020, author = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Perric and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, Mariama and Lhoest, Quentin and Rush, Alexander M.}, month = oct, pages = {38--45}, publisher = {Association for Computational Linguistics}, title = {{Transformers: State-of-the-Art Natural Language Processing}}, url = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}, year = {2020} }
@article {DBLP:journals/corr/abs-1907-11692, author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov}, title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach}, journal = {CoRR}, volume = {abs/1907.11692}, year = {2019}, url = {http://arxiv.org/abs/1907.11692}, archivePrefix = {arXiv}, eprint = {1907.11692}, timestamp = {Thu, 01 Aug 2019 08:59:33 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
@misc {jigsaw-unintended-bias-in-toxicity-classification, author = {cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum}, title = {Jigsaw Unintended Bias in Toxicity Classification}, publisher = {Kaggle}, year = {2019}, url = {https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification} }
@misc {lin2023toxicchat, title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation}, author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang}, year={2023}, eprint={2310.17389}, archivePrefix={arXiv}, primaryClass={cs.CL} }