Phi-4-reasoning-plus-GGUF开源推理模型 - 免费助力数学、科学和编程推理

首页

Phi 4 Reasoning Plus GGUF

由 unsloth 开发

Phi-4-reasoning-plus 是由微软研究院开发的开源推理模型，专注于数学、科学和编程领域的高级推理能力。

大型语言模型支持多种语言开源协议:MIT #数学推理优化 #强化学习增强 #科学编程专用

下载量 109.62k

发布时间 : 5/1/2025

模型简介

Phi-4-reasoning-plus 是一个基于 Phi-4 的先进推理模型，通过监督微调和强化学习在链式思维跟踪数据集上进行训练，专注于数学、科学和编程技能。

模型特点

高级推理能力

专注于数学、科学和编程领域的高级推理任务，通过监督微调和强化学习优化。

长上下文支持

支持长达32k标记的上下文长度，适合处理复杂任务。

高性能

在多个推理基准测试中表现优异，优于同类模型。

安全对齐

通过严格的安全后训练方法，确保模型在安全和道德准则下的使用。

模型能力

数学问题解答

科学问题解答

编程问题解决

链式思维推理

文本生成

使用案例

教育

数学奥林匹克问题解答

解决高难度的数学奥林匹克问题，如AIME和OmniMath中的题目。

在AIME 2025上达到78.0%的准确率。

研究生水平科学问题解答

解答复杂的、研究生水平的科学问题，如GPQA-Diamond中的题目。

在GPQA-D上达到68.9%的准确率。

编程

竞争性编程问题解答

解决来自竞争性编程竞赛的代码生成问题，如LiveCodeBench中的题目。

在LiveCodeBench上达到53.1%的准确率。

🚀 Phi-4-reasoning-plus

Phi-4-reasoning-plus是一个经过微调的先进推理模型，基于Phi-4进行监督微调与强化学习训练。它在数学、科学和编码等推理任务上表现出色，适用于对内存、计算和延迟有要求的场景。

🚀 快速开始

你可以使用transformers库来运行Phi-4-reasoning-plus模型，示例代码如下：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

你也可以使用vllm来运行模型：

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

⚠️ 重要提示

你必须在llama.cpp中使用--jinja来启用推理。否则，将不会提供<think>标记。

💡 使用建议

推理时，使用temperature=0.8、top_p=0.95和do_sample=True效果更佳。对于更复杂的查询，将最大令牌数设置为32k，以支持更长的思维链（CoT）。

✨ 主要特性

先进的推理能力：通过监督微调与强化学习训练，在数学、科学和编码等推理任务上表现出色。
高效的架构：基于14B参数的密集解码器Transformer模型，适用于内存和计算受限的环境。
长上下文处理：支持32k令牌的上下文长度，能够处理复杂的输入。
多场景适用：适用于对延迟有要求的场景，以及需要推理和逻辑的通用AI系统和应用。

📦 安装指南

文档未提供具体安装步骤，可参考上述快速开始部分的代码示例进行使用。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning-plus", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高级用法

vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1

📚 详细文档

模型概述

属性	详情
开发者	微软研究院
描述	Phi-4-reasoning-plus是一个最先进的开放权重推理模型，它基于Phi-4进行监督微调与强化学习训练。监督微调数据集包括合成提示和来自公共领域网站的高质量过滤数据，重点关注数学、科学和编码技能，以及安全和负责任AI的对齐数据。该模型还进行了强化学习训练，因此具有更高的准确性，但平均生成的令牌数增加了50%，从而导致更高的延迟。
架构	基础模型与之前发布的Phi-4相同，14B参数，密集解码器Transformer模型
输入	文本，最适合聊天格式的提示
上下文长度	32k令牌
GPU	32个H100-80G
训练时间	2.5天
训练数据	16B令牌，约8.3B唯一令牌
输出	对输入的生成文本，模型响应分为思维链块和总结块两部分
日期	2025年1月 - 2025年4月
状态	基于离线数据集训练的静态模型，公开可用数据的截止日期为2025年3月及更早
发布日期	2025年4月30日
许可证	MIT

预期用途

用途类型	详情
主要用例	该模型旨在加速语言模型的研究，作为生成式AI功能的构建块。它适用于通用AI系统和应用（主要是英语），需要在内存/计算受限的环境、对延迟有要求的场景以及需要推理和逻辑的场景中使用。
非预期用例	该模型仅针对数学推理进行设计和测试，并非专门为所有下游用途设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用之前评估和减轻准确性、安全性和公平性问题，特别是在高风险场景中。开发者应遵守适用的法律法规（包括隐私、贸易合规等）。

数据概述

训练数据集

训练数据是数学、科学和编码领域的问答和聊天格式数据的混合。聊天提示来自过滤后的高质量网络数据，并可选择通过合成数据生成管道进行重写和处理。此外，还包括提高真实性和安全性的数据。

基准数据集

使用开源的Eureka评估套件和内部基准评估Phi-4-reasoning-plus的能力，具体评估任务包括：

推理任务：AIME 2025、2024、2023和2022数学奥林匹克问题、GPQA-Diamond复杂科学问题、OmniMath奥林匹克级数学问题集、LiveCodeBench代码生成基准、3SAT和TSP算法问题解决、BA Calendar规划、Maze和SpatialMap空间理解。
通用基准：Kitab信息检索、IFEval和ArenaHard指令跟随、PhiBench内部基准、FlenQA提示长度对模型性能的影响、HumanEvalPlus功能代码生成、MMLU-Pro多任务语言理解聚合数据集。

安全性

方法

Phi-4-reasoning-plus采用了强大的安全后训练方法，通过监督微调（SFT），利用各种开源和内部生成的合成提示，以及符合微软严格安全指南的LLM生成响应。

安全评估和红队测试

在发布之前，Phi-4-reasoning-plus采用了多方面的评估方法。通过多个开源安全基准和内部工具进行定量评估，利用对抗性对话模拟。在定性安全评估方面，与微软的独立AI红队（AIRT）合作，评估在平均和对抗性用户场景中模型带来的安全风险。还在Toxigen基准上评估模型，该基准旨在测量针对少数群体的偏差和毒性。

模型质量

在代表性基准上的模型质量概述，以下表格中，数值越高表示性能越好：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

基准	Phi-4	Phi-4-reasoning	Phi-4-reasoning-plus	o3-mini	GPT-4o
FlenQA [3K-token subset]	82.0	97.7	97.9	96.8	90.8
IFEval Strict	62.3	83.4	84.9	91.5	81.8
ArenaHard	68.1	73.3	79.0	81.9	75.6
HumanEvalPlus	83.5	92.9	92.3	94.0	88.0
MMLUPro	71.5	74.3	76.0	79.4	73.0
Kitab 无上下文 - 精度有上下文 - 精度无上下文 - 召回率有上下文 - 召回率	19.3 88.5 8.2 68.1	23.2 91.5 4.9 74.8	27.6 93.6 6.3 75.4	37.9 94.0 4.2 76.1	53.7 84.7 20.3 69.2
Toxigen Discriminative 有毒类别中性类别	72.6 90.0	86.7 84.7	77.3 90.5	85.4 88.7	87.6 85.1
PhiBench 2.21	58.2	70.6	74.2	78.0	72.4