Phi-4-reasoning-unsloth-bnb-4bit开源推理模型 - 免费提升数理及编码推理能力

首页

Phi 4 Reasoning Unsloth Bnb 4bit

由 unsloth 开发

Phi-4-reasoning是一款由微软开发的先进推理模型，基于Phi-4进行微调，专注于提升数学、科学和编码等领域的推理能力。

大型语言模型

Transformers

支持多种语言开源协议:MIT #数学推理优化 #长上下文处理 #科学问题求解

下载量 1,969

发布时间 : 5/1/2025

模型简介

Phi-4-reasoning是一款开放权重的推理模型，通过监督微调和强化学习训练，适用于需要复杂推理任务的场景。

模型特点

先进的推理能力

通过监督微调和强化学习，专注于数学、科学和编码等领域的推理能力提升。

高效的架构设计

基于Phi-4基础模型，采用14B参数的密集仅解码器Transformer架构。

长上下文处理能力

支持32k令牌的上下文长度，能够处理复杂的输入。

广泛评估验证

在多个开源和内部基准测试中进行了评估，展示了出色的性能。

模型能力

数学推理

科学问题解答

代码生成

算法问题解决

复杂输入处理

使用案例

教育

数学奥林匹克问题解答

解决高难度的数学奥林匹克问题。

在AIME 2025基准测试中得分62.9。

科学问题解答

回答复杂的科学问题。

在GPQA-Diamond基准测试中得分65.8。

编程

代码生成

生成功能代码。

在HumanEvalPlus基准测试中得分92.9。

算法问题解决

解决3SAT和TSP等算法问题。

在LiveCodeBench基准测试中得分53.8。

🚀 Phi-4-reasoning模型卡片

Phi-4-reasoning是一款经过微调的先进推理模型，它基于Phi-4，通过监督微调与强化学习，在特定数据集上进行训练，专注于提升数学、科学和编码等方面的推理能力，为生成式AI功能提供了强大的构建模块。

🚀 快速开始

本模型适用于加速语言模型研究，可作为生成式AI功能的构建模块。若要使用该模型进行推理，可参考以下内容。

推理参数

推理时，建议设置temperature=0.8，top_p=0.95，并将do_sample设为True。对于更复杂的查询，可将最大令牌数设置为32k，以支持更长的思维链（CoT）。

输入格式

鉴于训练数据的性质，推理时请始终使用ChatML模板，并搭配以下系统提示：

<|im_start|>system<|im_sep|>
Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} <\think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|>
<|im_start|>user<|im_sep|>
What is the derivative of x^2?<|im_end|>
<|im_start|>assistant<|im_sep|>

使用`transformers`库

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

使用`vllm`库

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

Phi-4-reasoning也可直接在Ollama、llama.cpp和任何与Phi-4兼容的框架中使用。

✨ 主要特性

先进的推理能力：Phi-4-reasoning是一款先进的开放权重推理模型，通过监督微调在思维链痕迹数据集上进行训练，并结合强化学习，专注于提升数学、科学和编码等方面的推理能力。
高效的架构设计：基于先前发布的Phi-4基础模型，拥有14B参数，采用密集的仅解码器Transformer架构。
长上下文处理能力：支持32k令牌的上下文长度，能够处理复杂的输入。
广泛的评估验证：在多个开源和内部基准测试中进行了评估，包括数学奥林匹克问题、代码生成、算法问题解决等，展示了出色的性能。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

使用transformers库进行推理的基础示例：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-reasoning", device_map="auto", torch_dtype="auto")

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"},
    {"role": "user", "content": "What is the derivative of x^2?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

高级用法

使用vllm库进行推理的高级示例：

vllm serve microsoft/Phi-4-reasoning --enable-reasoning --reasoning-parser deepseek_r1

📚 详细文档

模型概述

属性	详情
开发者	Microsoft Research
描述	Phi-4-reasoning是一款先进的开放权重推理模型，基于Phi-4进行监督微调，并结合强化学习。监督微调数据集包括合成提示和来自公共领域网站的高质量过滤数据，专注于数学、科学和编码技能，以及安全和负责任AI的对齐数据。
架构	基础模型与先前发布的Phi-4相同，拥有14B参数，采用密集的仅解码器Transformer架构。
输入	文本，最适合聊天格式的提示。
上下文长度	32k令牌
GPU要求	32个H100-80G GPU
训练时间	2.5天
训练数据	16B令牌，约8.3B唯一令牌
输出	生成与输入对应的文本。模型响应分为两个部分，即思维链推理块和总结块。
日期	2025年1月 - 2025年4月
状态	基于离线数据集训练的静态模型，公开可用数据的截止日期为2025年3月及更早。
发布日期	2025年4月30日
许可证	MIT

预期用途

用途类型	详情
主要用例	该模型旨在加速语言模型研究，作为生成式AI功能的构建模块。适用于需要内存/计算受限环境、低延迟场景和推理逻辑的通用AI系统和应用（主要为英语）。
超出范围的用例	该模型仅针对数学推理进行设计和测试，并非针对所有下游用途进行专门设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用前评估和减轻准确性、安全性和公平性问题，特别是在高风险场景中。开发者应遵守适用的法律法规，并参考“负责任AI考虑因素”部分获取更多指导。

用途类型

详情

主要用例

该模型旨在加速语言模型研究，作为生成式AI功能的构建模块。适用于需要内存/计算受限环境、低延迟场景和推理逻辑的通用AI系统和应用（主要为英语）。

超出范围的用例

该模型仅针对数学推理进行设计和测试，并非针对所有下游用途进行专门设计或评估。开发者在选择用例时应考虑语言模型的常见限制，并在特定下游用例中使用前评估和减轻准确性、安全性和公平性问题，特别是在高风险场景中。开发者应遵守适用的法律法规，并参考“负责任AI考虑因素”部分获取更多指导。

数据概述

训练数据集

训练数据包括数学、科学和编码领域的问答和聊天格式数据。聊天提示来自过滤后的高质量网络数据，并可通过合成数据生成管道进行重写和处理。此外，还包括提高真实性和安全性的数据。

基准数据集

使用开源的Eureka评估套件和内部基准测试对Phi-4-reasoning进行评估，具体包括：

推理任务：AIME 2025、2024、2023和2022数学奥林匹克问题、GPQA-Diamond复杂科学问题、OmniMath奥林匹克级数学问题集、LiveCodeBench代码生成基准、3SAT和TSP算法问题解决、BA Calendar规划、Maze和SpatialMap空间理解。
通用基准：Kitab信息检索、IFEval和ArenaHard指令遵循、PhiBench内部基准、FlenQA提示长度对模型性能的影响、HumanEvalPlus功能代码生成、MMLU-Pro多任务语言理解聚合数据集。

安全性

方法

Phi-4-reasoning采用了强大的安全后训练方法，通过监督微调（SFT），利用多种开源和内部生成的合成提示，以及遵循严格Microsoft安全指南的LLM生成响应。

安全评估和红队测试

在发布前，Phi-4-reasoning采用了多方面的评估方法。通过多个开源安全基准和内部工具进行定量评估，利用对抗性对话模拟。与Microsoft的独立AI红队（AIRT）合作进行定性安全评估，评估在平均和对抗性用户场景下的安全风险。还在Toxigen基准上评估模型的偏差和毒性。

模型质量

在代表性基准上的模型质量概述：

模型	AIME 24	AIME 25	OmniMath	GPQA-D	LiveCodeBench (8/1/24–2/1/25)
Phi-4-reasoning	75.3	62.9	76.6	65.8	53.8
Phi-4-reasoning-plus	81.3	78.0	81.9	68.9	53.1
OpenThinker2-32B	58.0	58.0	—	64.1	—
QwQ 32B	79.5	65.8	—	59.5	63.4
EXAONE-Deep-32B	72.1	65.8	—	66.1	59.5
DeepSeek-R1-Distill-70B	69.3	51.5	63.4	66.2	57.5
DeepSeek-R1	78.7	70.4	85.0	73.0	62.8
o1-mini	63.6	54.8	—	60.0	53.8
o1	74.6	75.3	67.5	76.7	71.0
o3-mini	88.0	78.0	74.6	77.7	69.5
Claude-3.7-Sonnet	55.3	58.7	54.6	76.8	—
Gemini-2.5-Pro	92.0	86.7	61.1	84.0	69.2

模型	FlenQA [3K-token subset]	IFEval Strict	ArenaHard	HumanEvalPlus	MMLUPro	Kitab No Context - Precision With Context - Precision No Context - Recall With Context - Recall	Toxigen Discriminative Toxic category Neutral category	PhiBench 2.21
Phi-4	19.3 88.5 8.2 68.1	62.3	68.1	83.5	71.5	19.3 88.5 8.2 68.1	72.6 90.0	58.2
Phi-4-reasoning	23.2 91.5 4.9 74.8	83.4	73.3	92.9	74.3	23.2 91.5 4.9 74.8	86.7 84.7	70.6
Phi-4-reasoning-plus	27.6 93.6 6.3 75.4	84.9	79.0	92.3	76.0	27.6 93.6 6.3 75.4	77.3 90.5	74.2
o3-mini	37.9 94.0 4.2 76.1	91.5	81.9	94.0	79.4	37.9 94.0 4.2 76.1	85.4 88.7	78.0
GPT-4o	53.7 84.7 20.3 69.2	81.8	75.6	88.0	73.0	53.7 84.7 20.3 69.2	87.6 85.1	72.4

总体而言，Phi-4-reasoning仅拥有14B参数，在广泛的推理任务中表现出色，显著优于更大的开放权重模型，如DeepSeek-R1蒸馏70B模型，并接近完整的DeepSeek R1模型的性能水平。

负责任AI考虑因素

与其他语言模型一样，Phi-4-reasoning可能存在不公平、不可靠或冒犯性的行为。开发者应应用负责任AI最佳实践，并确保特定用例符合相关法律法规。建议使用Azure AI Content Safety等安全服务。

限制行为

服务质量：模型主要基于英语文本训练，非英语语言的性能会较差。训练数据中代表性较少的英语变体可能比标准美式英语的性能更差。Phi-4-reasoning不支持多语言使用。
伤害表示与刻板印象延续：模型可能过度或不足地表示某些人群，抹去某些群体的代表性，或强化贬低性或负面刻板印象。
不适当或冒犯性内容：模型可能产生其他类型的不适当或冒犯性内容，在敏感上下文中部署时需要额外的缓解措施。
信息可靠性：语言模型可能生成无意义的内容或编造听起来合理但不准确或过时的内容。
选举信息可靠性：模型在回答选举关键查询时存在较高的缺陷率，可能导致呈现不正确或无权威性的选举关键信息。
代码范围有限：Phi-4-reasoning的大部分训练数据基于Python，并使用常见包。如果模型生成使用其他包或其他语言的Python脚本，建议用户手动验证所有API使用。