🚀 STILL-3-1.5B-preview:慢思考推理模型
我们发布了 STILL-3-1.5B-preview,这是一个慢思考推理模型,在AIME基准测试中达到了39.33%的准确率!我们在15亿参数的模型上应用了强化学习,并观察到随着训练步数的增加,模型性能持续提升。为了更好地复现我们的工作并推动研究进展,我们开源了代码、模型和数据。
代码链接:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
🚀 快速开始
from transformers import AutoTokenizer, AutoModelForCausalLM
from vllm import LLM, SamplingParams
tokenizer = AutoTokenizer.from_pretrained("RUC-AIBOX/STILL-3-1.5B-preview")
model = AutoModelForCausalLM.from_pretrained("RUC-AIBOX/STILL-3-1.5B-preview")
question = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$"
input_prompts = tokenizer.apply_chat_template(
[
{"role": "user", "content": question}],
tokenize=False,
add_generation_prompt=True
)
llm = LLM(model=model_path, tensor_parallel_size=1, dtype='bfloat16')
sampling_params_gs = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=32768, stop=stop_words, seed=42, skip_special_tokens=False)
responses = model.generate(input_prompts, sampling_params)
print(responses[0].outputs[0].text)
✨ 主要特性
我们对模型在四个基准测试上进行了评估:MATH、AIME、OMNI和LiveAOPS。对于MATH和AIME,我们采用了采样解码设置,采样温度为0.6,top-p采样概率为0.95。每个问题采样64次,并计算平均分。对于OMNI和LiveAOPS(2024年8月 - 11月),我们随机抽取了一部分答案作为整数以方便自动评估,并使用贪心搜索解码进行评估。训练后的模型STILL-3-1.5B-preview取得了显著的改进。AIME任务的准确率从28.67%提高到39.33%,相对提升了37.18%。
|
MATH |
AIME |
OMNI |
LiveAOPS |
平均 |
基础模型 |
84.04 |
28.67 |
25.60 |
33.33 |
42.91 |
STILL-3-1.5B-preview |
85.48 |
39.33 |
33.00 |
39.50 |
49.33 |
📚 详细文档
如果我们的报告对您的研究有帮助,请引用以下内容:
@article{Slow_Thinking_with_LLMs_3_Preview,
title={STILL-3-1.5B-preview: Enhancing Slow Thinking Abilities of Small Models through Reinforcement Learning
},
author={RUCAIBox STILL Team},
url={https://github.com/RUCAIBox/Slow_Thinking_with_LLMs},
year={2025}
}
@article{Slow_Thinking_with_LLMs_1,
title={Enhancing LLM Reasoning with Reward-guided Tree Search},
author={Jiang, Jinhao and Chen, Zhipeng and Min, Yingqian and Chen, Jie and Cheng, Xiaoxue and Wang, Jiapeng and Tang, Yiru and Sun, Haoxiang and Deng, Jia and Zhao, Wayne Xin and Liu, Zheng and Yan, Dong and Xie, Jian and Wang, Zhongyuan and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2411.11694},
year={2024}
}
@article{Slow_Thinking_with_LLMs_2,
title={Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems},
author={Min, Yingqian and Chen, Zhipeng and Jiang, Jinhao and Chen, Jie and Deng, Jia and Hu, Yiwen and Tang, Yiru and Wang, Jiapeng and Cheng, Xiaoxue and Song, Huatong and Zhao, Wayne Xin and Liu, Zheng and Wang, Zhongyuan and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2412.09413},
year={2024}
}