CompassJudger-2-7B-Instruct开源评判模型 - 突破局限实现广泛专业内容评判

首页

Compassjudger 2 7B Instruct

由 opencompass 开发

CompassJudger-2 是一系列全新的通用评判模型，旨在克服现有大语言模型评判方案专业性狭窄和鲁棒性有限的问题。

大型语言模型

Transformers

开源协议:Apache-2.0 #AI评判模型 #多领域评估 #奖励引导训练

下载量 151

发布时间 : 7/9/2025

模型简介

CompassJudger-2 采用强大的新训练范式，解决了当前评判模型在综合评估方面的难题，适用于多领域的评判任务。

模型特点

先进的数据策略

采用任务驱动、多领域的数据整理和合成策略，增强模型的鲁棒性和领域适应性。

可验证的奖励引导训练

使用可验证的奖励监督评判任务，通过思维链（CoT）和拒绝采样引导模型的内在推理。

卓越的性能

在多个评判和奖励基准测试中取得了最先进的成果。7B 模型与更大规模的模型相比，也展现出了有竞争力的准确性。

JudgerBenchV2

引入了一个全新的综合基准测试，涵盖 10 种场景的 10000 个问题，使用评判器混合（MoJ）共识来获得更可靠的真实标签。

模型能力

AI响应质量评估

多维度评判（帮助性、相关性、准确性、深度、创造性和细节水平）

基准测试评估

使用案例

AI模型评估

AI助手响应比较

评估两个AI助手对同一问题的响应质量，选择更优的响应。

提供结构化选择结果，如{'Choice': '[Model A or Model B]'}

基准测试

JudgerBenchV2评估

在涵盖10种场景的10000个问题上进行综合评估。

在多个基准测试中取得最先进成果

🚀 CompassJudger-2

CompassJudger-2 是一系列全新的通用评判模型，旨在克服现有大语言模型评判方案专业性狭窄和鲁棒性有限的问题。它采用强大的新训练范式，解决了当前评判模型在综合评估方面的难题。

✨ 主要特性

先进的数据策略：采用任务驱动、多领域的数据整理和合成策略，增强模型的鲁棒性和领域适应性。
可验证的奖励引导训练：使用可验证的奖励监督评判任务，通过思维链（CoT）和拒绝采样引导模型的内在推理。精细的边际策略梯度损失进一步提升性能。
卓越的性能：在多个评判和奖励基准测试中取得了最先进的成果。7B 模型与更大规模的模型相比，也展现出了有竞争力的准确性。
JudgerBenchV2：引入了一个全新的综合基准测试，涵盖 10 种场景的 10000 个问题，使用评判器混合（MoJ）共识来获得更可靠的真实标签。

📦 安装指南

文档未提及安装步骤，故跳过该章节。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "opencompass/CompassJudger-2-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example: Pair-wise Comparison
prompt = """
Please act as an impartial judge to evaluate the responses provided by two AI assistants to the user question below. Your evaluation should focus on the following criteria: helpfulness, relevance, accuracy, depth, creativity, and level of detail.

- Do not let the order of presentation, response length, or assistant names influence your judgment.
- Base your decision solely on how well each response addresses the user’s question and adheres to the instructions.

Your final reply must be structured in the following format:
{
  "Choice": "[Model A or Model B]"
}

User Question: {question}

Model A's Response: {answerA}

Model B's Response: {answerB}

Now it's your turn. Please provide selection result as required:
"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

📚 详细文档

模型信息

模型名称	规模	基础模型	下载链接	说明
👉 CompassJudger-2-7B-Instruct	7B	Qwen2.5-7B-Instruct	🤗 模型	针对通用评判能力进行微调。
👉 CompassJudger-2-32B-Instruct	32B	Qwen2.5-32B-Instruct	🤗 模型	更大、更强大的评判模型。

评估结果

CompassJudger-2 在评判模型方面树立了新的标杆，在广泛的基准测试中超越了通用模型、奖励模型和其他专业评判模型。

模型	JudgerBench V2	JudgeBench	RMB	RewardBench	平均
7B 评判模型
CompassJudger-1-7B-Instruct	57.96	46.00	38.18	80.74	55.72
Con-J-7B-Instruct	52.35	38.06	71.50	87.10	62.25
RISE-Judge-Qwen2.5-7B	46.12	40.48	72.64	88.20	61.61
CompassJudger-2-7B-Instruct	60.52	63.06	73.90	90.96	72.11
32B+ 评判模型
CompassJudger-1-32B-Instruct	60.33	62.29	77.63	86.17	71.61
Skywork-Critic-Llama-3.1-70B	52.41	50.65	65.50	93.30	65.47
RISE-Judge-Qwen2.5-32B	56.42	63.87	73.70	92.70	71.67
CompassJudger-2-32B-Instruct	62.21	65.48	72.98	92.62	73.32
通用模型（仅供参考）
Qwen2.5-32B-Instruct	62.97	59.84	74.99	85.61	70.85
DeepSeek-V3-0324	64.43	59.68	78.16	85.17	71.86
Qwen3-235B-A22B	61.40	65.97	75.59	84.68	71.91

有关详细的基准测试性能和方法，请参考我们的 📑 论文。

🔧 技术细节

文档未提供相关技术细节，故跳过该章节。

📄 许可证

本项目采用 Apache 2.0 许可证。有关详细信息，请参阅 LICENSE 文件。

引用

如果您认为我们的工作有帮助，请考虑引用我们的论文：

@article{zhang2025compassjudger,
  title={CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards},
  author={Zhang, Taolin and Cao, Maosong and Lam, Alexander and Zhang, Songyang and Chen, Kai},
  journal={arXiv preprint arXiv:2507.09104},
  year={2025}
}