Llama-3.1-Nemotron-Nano-8B-v1开源对话模型 - 长上下文推理，效率性能双优！

首页

Llama 3.1 Nemotron Nano 8B V1

由 nvidia 开发

基于Meta Llama-3.1-8B-Instruct优化的推理与对话模型，支持128K上下文长度，平衡效率与性能

大型语言模型

Transformers

英语开源协议:其他 #高效推理优化 #128K长文本支持 #单卡RTX部署

下载量 60.52k

发布时间 : 3/16/2025

模型简介

专注于推理能力、人类对话偏好及任务执行（如RAG和工具调用）的大型语言模型，支持单卡RTX GPU本地部署

模型特点

双模式推理

支持推理ON/OFF模式切换，ON模式提供逐步思考过程，OFF模式直接输出结果

长上下文支持

支持长达128K tokens的上下文窗口，适合处理复杂文档和长对话

高效部署

优化后可在RTX系列消费级GPU上单卡运行，降低部署门槛

强化学习优化

通过多轮强化学习（RLOO/RPO）优化人类偏好对齐和任务执行能力

模型能力

数学推理

代码生成

工具调用

多轮对话

多语言支持

RAG系统集成

使用案例

智能助手

数学问题求解

解决复杂数学方程和证明题

MATH500测试集达到95.4%准确率

编程辅助

生成和调试Python代码

MBPP零样本测试84.6%通过率

企业应用

文档分析

处理长文档和合同文本分析

支持128K上下文长度

知识问答系统

构建基于RAG的专业领域问答系统

BFCL v2测试63.9%得分

🚀 Llama-3.1-Nemotron-Nano-8B-v1

Llama-3.1-Nemotron-Nano-8B-v1 是一款大型语言模型，它基于 Meta Llama-3.1-8B-Instruct 衍生而来。该模型在推理、满足人类聊天偏好以及处理如 RAG 和工具调用等任务方面表现出色，在模型准确性和效率之间取得了良好的平衡。

🚀 快速开始

你可以通过预览 API 试用此模型，链接如下：Llama-3.1-Nemotron-Nano-8B-v1。

使用 Hugging Face Transformers 库的使用示例如下，推理模式（开启/关闭）通过系统提示进行控制，请参考以下示例。我们的代码要求 transformers 包的版本为 4.44.2 或更高。

💻 使用示例

基础用法

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   temperature=0.6,
   top_p=0.95,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "on"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

高级用法

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Thinking can be "on" or "off"
thinking = "off"

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))

🔧 使用建议

推理模式（开启/关闭）通过系统提示控制，系统提示必须按以下示例设置，所有指令应包含在用户提示中。
对于推理开启模式，建议将温度设置为 0.6，Top P 设置为 0.95。
对于推理关闭模式，建议使用贪心解码。
我们为每个需要特定模板的基准测试提供了用于评估的提示列表。
在推理开启模式下，如果不需要推理，模型会包含 <think></think>，这是预期行为。

✨ 主要特性

高效推理：经过多阶段的后训练过程，增强了推理和非推理能力，在推理任务中表现出色。
多语言支持：支持英语和多种编码语言，同时也支持德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语等非英语语言。
高性价比：模型可以在单个 RTX GPU 上运行，适合本地使用，在模型准确性和计算效率之间取得了良好的平衡。
长上下文支持：支持长达 128K 的上下文长度。

📚 详细文档

模型概述

Llama-3.1-Nemotron-Nano-8B-v1 是一个大型语言模型（LLM），它是 Meta Llama-3.1-8B-Instruct（即参考模型）的衍生模型。它是一个推理模型，经过后训练以提升推理能力、满足人类聊天偏好并处理如 RAG 和工具调用等任务。

该模型在模型准确性和效率之间取得了很好的平衡，它基于 Llama 3.1 8B Instruct 创建，并在模型准确性方面有所改进。模型可以在单个 RTX GPU 上运行，支持本地使用，且支持 128K 的上下文长度。

此模型经过多阶段的后训练过程，以增强其推理和非推理能力。这包括针对数学、代码、推理和工具调用的有监督微调阶段，以及使用 REINFORCE (RLOO) 和在线奖励感知偏好优化 (RPO) 算法进行的多个强化学习 (RL) 阶段，用于聊天和指令跟随。最终的模型检查点是在合并最终的 SFT 和在线 RPO 检查点后获得的，并使用 Qwen 进行了改进。

该模型是 Llama Nemotron 系列的一部分，你可以在此处找到该系列的其他模型：Llama-3.3-Nemotron-Super-49B-v1。

此模型可用于商业用途。

许可证/使用条款

适用条款：你对该模型的使用受 NVIDIA 开放模型许可证约束。
附加信息：Llama 3.1 社区许可协议。该模型基于 Llama 构建。

模型开发者：NVIDIA

模型日期：于 2024 年 8 月至 2025 年 3 月期间训练

数据新鲜度：根据 Meta Llama 3.1 8B，预训练数据截止到 2023 年

使用场景

适用于设计 AI 代理系统、聊天机器人、RAG 系统和其他 AI 应用程序的开发者，也适用于典型的指令跟随任务。该模型在模型准确性和计算效率之间取得了平衡（可以在单个 RTX GPU 上运行并支持本地使用）。

发布日期

2025 年 3 月 18 日

参考文献

模型架构

属性	详情
架构类型	密集型仅解码器 Transformer 模型
网络架构	Llama 3.1 8B Instruct

预期用途

Llama-3.1-Nemotron-Nano-8B-v1 是一个通用的推理和聊天模型，旨在用于英语和编码语言，同时也支持德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语等非英语语言。

输入

属性	详情
输入类型	文本
输入格式	字符串
输入参数	一维 (1D)
其他输入相关属性	上下文长度最大为 131,072 个标记

输出

属性	详情
输出类型	文本
输出格式	字符串
输出参数	一维 (1D)
其他输出相关属性	上下文长度最大为 131,072 个标记

模型版本

1.0 (2025 年 3 月 18 日)

软件集成

运行时引擎：NeMo 24.12
推荐的硬件微架构兼容性：
- NVIDIA Hopper
- NVIDIA Ampere

推理

引擎：Transformers
测试硬件：
- BF16：
  - 1x RTX 50 系列 GPU
  - 1x RTX 40 系列 GPU
  - 1x RTX 30 系列 GPU
  - 1x H100-80GB GPU
  - 1x A100-80GB GPU
首选/支持的操作系统：Linux

训练数据集

后训练管道使用了多种训练数据，包括手动标注数据和合成数据。

用于代码、数学和推理改进的多阶段后训练阶段的数据是 SFT 和 RL 数据的集合，支持提升原始 Llama 指令模型的数学、代码、通用推理和指令跟随能力。

提示来源可以是公共开放语料库或合成生成。响应由多种模型合成生成，一些提示包含推理开启和关闭模式的响应，用于训练模型区分两种模式。

训练数据集的数据收集：混合方式：自动化、人工、合成
训练数据集的数据标注：不适用

评估数据集

我们使用以下数据集对 Llama-3.1-Nemotron-Nano-8B-v1 进行评估。

评估数据集的数据收集：混合方式：人工/合成
评估数据集的数据标注：混合方式：人工/合成/自动

评估结果

这些结果包含“推理开启”和“推理关闭”两种模式。我们建议在“推理开启”模式下使用温度=0.6，top_p=0.95，在“推理关闭”模式下使用贪心解码。所有评估均使用 32k 序列长度进行。我们对基准测试运行多达 16 次并取平均分数以提高准确性。

⚠️ 重要提示

在适用的情况下，将提供提示模板。在完成基准测试时，请确保按照提供的提示解析正确的输出格式，以重现以下基准测试结果。

MT-Bench

推理模式	分数
推理关闭	7.9
推理开启	8.1

MATH500

推理模式	pass@1
推理关闭	36.6%
推理开启	95.4%

用户提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

AIME25

推理模式	pass@1
推理关闭	0%
推理开启	47.1%

用户提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

GPQA-D

推理模式	pass@1
推理关闭	39.4%
推理开启	54.1%

用户提示模板：

"What is the correct answer to this question: {question}\nChoices:\nA. {option_A}\nB. {option_B}\nC. {option_C}\nD. {option_D}\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \boxed{}"

IFEval 平均

推理模式	严格：提示	严格：指令
推理关闭	74.7%	82.1%
推理开启	71.9%	79.3%

BFCL v2 Live

推理模式	分数
推理关闭	63.9%
推理开启	63.6%

用户提示模板：

<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>

{user_prompt}

MBPP 0-shot

推理模式	pass@1
推理关闭	66.1%
推理开启	84.6%

用户提示模板：

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Here is the given problem and test examples:
{prompt}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here


### 伦理考虑
NVIDIA 认为可信 AI 是一项共同责任，我们已经制定了政策和实践，以支持广泛的 AI 应用开发。当开发者按照我们的服务条款下载或使用此模型时，应与内部模型团队合作，确保该模型满足相关行业和用例的要求，并解决不可预见的产品滥用问题。

有关此模型伦理考虑的更多详细信息，请参阅模型卡片++ [可解释性](explainability.md)、[偏差](bias.md)、[安全与保障](safety.md) 和 [隐私](privacy.md) 子卡片。

请 [在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/) 报告安全漏洞或 NVIDIA AI 相关问题。

### 引用

@misc{bercovich2025llamanemotronefficientreasoningmodels, title={Llama-Nemotron: Efficient Reasoning Models}, author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk}, year={2025}, eprint={2505.00949}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.00949}, }


## 📄 许可证
本模型的使用受 [NVIDIA 开放模型许可证](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) 约束。附加信息请参考 [Llama 3.1 社区许可协议](https://www.llama.com/llama3_1/license/)。