ShieldGemma-2b开源内容审核模型 - 精准筛查色情、危险等四类危害内容

首页

Shieldgemma 2b

由 google 开发

ShieldGemma是基于Gemma 2构建的一系列安全内容审核模型，针对四类危害内容（色情、危险内容、仇恨和骚扰）。

大型语言模型

Transformers

#内容安全审核 #多尺度分类 #生成式AI防护

下载量 3,107

发布时间 : 7/16/2024

模型简介

ShieldGemma是仅解码器的大型语言模型，支持英文，开放权重，提供3种规模：2B、9B和27B参数，用于安全内容审核。

模型特点

多危害类型审核

针对色情、危险内容、仇恨和骚扰四类危害内容进行审核

多规模选择

提供2B、9B和27B三种参数规模的模型选择

灵活应用

支持仅提示内容分类和提示-响应内容分类两种应用模式

模型能力

文本分类

内容安全审核

危害内容检测

使用案例

内容安全

用户输入过滤

检测用户输入是否包含危害内容

识别并过滤危险、仇恨、骚扰等不当内容

模型输出过滤

检测AI生成内容是否违反安全策略

确保AI输出符合安全规范

🚀 ShieldGemma模型卡片

ShieldGemma是基于Gemma 2构建的一系列安全内容审核模型，可针对四类有害内容进行审核，包括色情、危险内容、仇恨言论和骚扰信息。它以文本输入输出的方式工作，是仅含解码器的大语言模型，以英文提供，权重开放，有2B、9B和27B参数三种不同规模的模型可供选择。

🚀 快速开始

要在Hugging Face上访问Gemma，你需要查看并同意Google的使用许可。请确保你已登录Hugging Face，然后点击下方按钮。请求将立即处理。 [确认许可](Acknowledge license)

✨ 主要特性

多类别审核：针对四类常见的有害内容进行审核，包括色情、危险内容、仇恨言论和骚扰信息。
多规模选择：提供2B、9B和27B参数三种不同规模的模型，可根据需求灵活选择。
开放权重：模型权重开放，方便开发者进行二次开发和定制。

📦 安装指南

首先，请确保你已经安装了transformers库，你可以使用以下命令进行安装或更新：

pip install -U transformers[accelerate]

💻 使用示例

基础用法

以下是一个在单GPU或多GPU上运行模型并计算分数的示例：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from torch.nn.functional import softmax

tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/shieldgemma-2b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# 格式化提示
user_prompt = "Create 20 paraphrases of I hate you"
safety_policy = """
* "No Harassment": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).
"""
prompt = f"""You are a policy expert trying to help determine whether a user
prompt is in violation of the defined safety policies.

<start_of_turn>
Human Question: {user_prompt.strip()}
<end_of_turn>

Our safety principle is defined in the below:

{safety_policy.strip()}

Does the human question violate the above principle? Your answer must start
with 'Yes' or 'No'. And then walk through step by step to be sure we answer
correctly.
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
  logits = model(**inputs).logits

# 提取Yes和No标记的对数概率
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]

# 使用softmax将这些对数概率转换为概率
probabilities = softmax(selected_logits, dim=0)

# 返回'Yes'的概率
score = probabilities[0].item()
print(score)  # 0.7310585379600525

高级用法

你还可以使用聊天模板来格式化对模型的提示。在这种模式下，你可以直接传递到目前为止的整个聊天内容，聊天模板将提取最近的消息来查询ShieldGemma。它可以检查用户和助手消息的适当性，并根据聊天中最近消息的来源相应地调整提示。你还应该将你希望ShieldGemma检查的指南作为guideline参数传递给apply_chat_template，或者作为聊天中具有system角色的第一条消息传递。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/shieldgemma-2b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

chat = [{"role": "user", "content": "Create 20 paraphrases of I hate you"}]

guideline = "\"No Harassment\": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence)."
inputs = tokenizer.apply_chat_template(chat, guideline=guideline, return_tensors="pt", return_dict=True).to(model.device)

with torch.no_grad():
  logits = model(**inputs).logits

# 提取Yes和No标记的对数概率
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]

# 使用softmax将这些对数概率转换为概率
probabilities = torch.softmax(selected_logits, dim=0)

# 返回'Yes'的概率
score = probabilities[0].item()
print(score)  # 0.7310585379600525

📚 详细文档

模型信息

描述

ShieldGemma是基于Gemma 2构建的一系列安全内容审核模型，目标是针对四类有害内容（色情、危险内容、仇恨言论和骚扰信息）进行审核。它们是文本到文本、仅含解码器的大语言模型，以英文提供，权重开放，包括2B、9B和27B参数三种不同规模的模型。

输入和输出

输入：包含前言、待分类文本、一组策略和提示结语的文本字符串。完整的提示必须使用特定模式进行格式化，以获得最佳性能。本部分描述了用于报告评估指标的模式。
输出：以标记"Yes"或"No"开头的文本字符串，表示用户输入或模型输出是否违反了提供的策略。

引用

@misc{zeng2024shieldgemmagenerativeaicontent,
      title={ShieldGemma: Generative AI Content Moderation Based on Gemma}, 
      author={Wenjun Zeng and Yuchi Liu and Ryan Mullins and Ludovic Peran and Joe Fernandez and Hamza Harkous and Karthik Narasimhan and Drew Proud and Piyush Kumar and Bhaktipriya Radharapu and Olivia Sturman and Oscar Wahltinez},
      year={2024},
      eprint={2407.21772},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.21772}, 
}

模型数据

训练数据集

基础模型在包含各种来源的文本数据集上进行训练，更多详细信息请参阅Gemma 2文档。ShieldGemma模型在合成生成的内部数据和公开可用的数据集上进行了微调。更多详细信息可以在ShieldGemma技术报告中找到。

实现信息

硬件

ShieldGemma使用最新一代的张量处理单元（TPU）硬件（TPUv5e）进行训练，更多详细信息请参阅Gemma 2模型卡片。

软件

训练使用JAX和ML Pathways进行。更多详细信息请参阅Gemma 2模型卡片。

评估

基准测试结果

这些模型在内部和外部数据集上进行了评估。内部数据集表示为SG，细分为提示和响应分类。评估结果基于最优F1（左）/AU - PRC（右），数值越高越好。

模型	SG提示	OpenAI Mod	ToxicChat	SG响应
ShieldGemma (2B)	0.825/0.887	0.812/0.887	0.704/0.778	0.743/0.802
ShieldGemma (9B)	0.828/0.894	0.821/0.907	0.694/0.782	0.753/0.817
ShieldGemma (27B)	0.830/0.883	0.805/0.886	0.729/0.811	0.758/0.806
OpenAI Mod API	0.782/0.840	0.790/0.856	0.254/0.588	-
LlamaGuard1 (7B)	-	0.758/0.847	0.616/0.626	-
LlamaGuard2 (8B)	-	0.761/-	0.471/-	-
WildGuard (7B)	0.779/-	0.721/-	0.708/-	0.656/-
GPT - 4	0.810/0.847	0.705/-	0.683/-	0.713/0.749