🚀 CompassJudger-2
CompassJudger-2 是一系列全新的通用评判模型,旨在克服现有大语言模型评判方案专业性狭窄和鲁棒性有限的问题。它采用强大的新训练范式,解决了当前评判模型在综合评估方面的难题。
✨ 主要特性
- 先进的数据策略:采用任务驱动、多领域的数据整理和合成策略,增强模型的鲁棒性和领域适应性。
- 可验证的奖励引导训练:使用可验证的奖励监督评判任务,通过思维链(CoT)和拒绝采样引导模型的内在推理。精细的边际策略梯度损失进一步提升性能。
- 卓越的性能:在多个评判和奖励基准测试中取得了最先进的成果。7B 模型与更大规模的模型相比,也展现出了有竞争力的准确性。
- JudgerBenchV2:引入了一个全新的综合基准测试,涵盖 10 种场景的 10000 个问题,使用评判器混合(MoJ)共识来获得更可靠的真实标签。
📦 安装指南
文档未提及安装步骤,故跳过该章节。
💻 使用示例
基础用法
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "opencompass/CompassJudger-2-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """
Please act as an impartial judge to evaluate the responses provided by two AI assistants to the user question below. Your evaluation should focus on the following criteria: helpfulness, relevance, accuracy, depth, creativity, and level of detail.
- Do not let the order of presentation, response length, or assistant names influence your judgment.
- Base your decision solely on how well each response addresses the user’s question and adheres to the instructions.
Your final reply must be structured in the following format:
{
"Choice": "[Model A or Model B]"
}
User Question: {question}
Model A's Response: {answerA}
Model B's Response: {answerB}
Now it's your turn. Please provide selection result as required:
"""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
📚 详细文档
模型信息
模型名称 |
规模 |
基础模型 |
下载链接 |
说明 |
👉 CompassJudger-2-7B-Instruct |
7B |
Qwen2.5-7B-Instruct |
🤗 模型 |
针对通用评判能力进行微调。 |
👉 CompassJudger-2-32B-Instruct |
32B |
Qwen2.5-32B-Instruct |
🤗 模型 |
更大、更强大的评判模型。 |
评估结果
CompassJudger-2 在评判模型方面树立了新的标杆,在广泛的基准测试中超越了通用模型、奖励模型和其他专业评判模型。
模型 |
JudgerBench V2 |
JudgeBench |
RMB |
RewardBench |
平均 |
7B 评判模型 |
|
|
|
|
|
CompassJudger-1-7B-Instruct |
57.96 |
46.00 |
38.18 |
80.74 |
55.72 |
Con-J-7B-Instruct |
52.35 |
38.06 |
71.50 |
87.10 |
62.25 |
RISE-Judge-Qwen2.5-7B |
46.12 |
40.48 |
72.64 |
88.20 |
61.61 |
CompassJudger-2-7B-Instruct |
60.52 |
63.06 |
73.90 |
90.96 |
72.11 |
32B+ 评判模型 |
|
|
|
|
|
CompassJudger-1-32B-Instruct |
60.33 |
62.29 |
77.63 |
86.17 |
71.61 |
Skywork-Critic-Llama-3.1-70B |
52.41 |
50.65 |
65.50 |
93.30 |
65.47 |
RISE-Judge-Qwen2.5-32B |
56.42 |
63.87 |
73.70 |
92.70 |
71.67 |
CompassJudger-2-32B-Instruct |
62.21 |
65.48 |
72.98 |
92.62 |
73.32 |
通用模型(仅供参考) |
|
|
|
|
|
Qwen2.5-32B-Instruct |
62.97 |
59.84 |
74.99 |
85.61 |
70.85 |
DeepSeek-V3-0324 |
64.43 |
59.68 |
78.16 |
85.17 |
71.86 |
Qwen3-235B-A22B |
61.40 |
65.97 |
75.59 |
84.68 |
71.91 |
有关详细的基准测试性能和方法,请参考我们的 📑 论文。
🔧 技术细节
文档未提供相关技术细节,故跳过该章节。
📄 许可证
本项目采用 Apache 2.0 许可证。有关详细信息,请参阅 LICENSE 文件。
引用
如果您认为我们的工作有帮助,请考虑引用我们的论文:
@article{zhang2025compassjudger,
title={CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards},
author={Zhang, Taolin and Cao, Maosong and Lam, Alexander and Zhang, Songyang and Chen, Kai},
journal={arXiv preprint arXiv:2507.09104},
year={2025}
}