开源phi3-hallucination-judge-merge模型 - 有效检测语言模型输出的幻觉问题

首页

Phi3 Hallucination Judge Merge

由 grounded-ai 开发

该模型用于检测语言模型输出中的幻觉现象，即响应连贯但事实错误或脱离上下文的情况。

大型语言模型

Transformers

开源协议:MIT #幻觉检测 #二分类任务 #PEFT微调

下载量 63

发布时间 : 4/25/2025

模型简介

一个专门用于检测语言模型输出幻觉的二分类模型，通过微调实现高性能的幻觉检测能力。

模型特点

高性能幻觉检测

在幻觉检测任务中表现优异，F1分数达到0.81，超越多个前沿语言模型。

轻量级适配器

采用PEFT适配器技术，实现高效微调而不需要修改基础模型。

标准化提示策略

提供标准化的输入格式和提示策略，便于快速集成到现有系统中。

模型能力

幻觉检测

文本分类

语言模型输出评估

使用案例

语言模型质量评估

模型输出验证

验证语言模型生成内容的事实准确性

准确识别85%的幻觉输出

内容审核

事实核查

自动检测生成内容中的事实错误

召回率达到87%的错误检测

🚀 幻觉检测PEFT适配器模型

本仓库包含我们用于幻觉评估的PEFT适配器模型。该模型旨在检测语言模型输出中的幻觉现象，通过二分类任务评估模型是否产生了与事实不符或无意义的输出。

🚀 快速开始

幻觉检测指标

我们的合并模型在检测语言模型输出幻觉的二分类任务中取得了以下性能：

              precision    recall  f1-score   support

           0       0.85      0.71      0.77       100
           1       0.75      0.87      0.81       100

    accuracy                           0.79       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.79      0.79       200

模型使用

为获得最佳效果，我们建议从以下提示策略开始（并鼓励根据需要进行调整）：

def format_input(reference, query, response):
    prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
    A hallucination occurs when the response is coherent but factually incorrect or nonsensical
    outputs that are not grounded in the provided context.
    You are given the following information:
    ####INFO####
    [Knowledge]: {reference}
    [User Input]: {query}
    [Model Response]: {response}
    ####END INFO####
    Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
    """
    return input

text = format_input(query='Based on the follwoing <context>Walrus are the largest mammal</context> answer the question <query> What is the best PC?</query>',
          response='The best PC is the mac')

messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)
generation_args = {
      "max_new_tokens": 2,
      "return_full_text": False,
      "temperature": 0.01,
      "do_sample": True,
  }

output = pipe(messages, **generation_args)
print(f'Hallucination: {output[0]["generated_text"].strip().lower()}')
# Hallucination: yes

与其他模型的比较

我们将合并模型在幻觉检测基准上的性能与其他几个最先进的语言模型进行了比较：

模型	精确率	召回率	F1分数
我们的合并模型	0.75	0.87	0.81
GPT - 4	0.93	0.72	0.82
GPT - 4 Turbo	0.97	0.70	0.81
Gemini Pro	0.89	0.53	0.67
GPT - 3.5	0.89	0.65	0.75
GPT - 3.5 - turbo - instruct	0.89	0.80	0.84
Palm 2 (Text Bison)	1.00	0.44	0.61
Claude V2	0.80	0.95	0.87

如表所示，我们的合并模型在幻觉检测任务中取得了0.81的F1分数，优于其他几个最先进的语言模型。

我们将继续改进和微调合并模型，以在各种基准和任务中实现更好的性能。

引用

分数来自arize/phoenix

📚 详细文档

训练数据

本模型的训练数据引用自以下文献： @misc{HaluEval, author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian - Yun Nie and Ji - Rong Wen }, title = {HaluEval: A Large - Scale Hallucination Evaluation Benchmark for Large Language Models}, year = {2023}, journal={arXiv preprint arXiv:2305.11747}, url={https://arxiv.org/abs/2305.11747} }