开源Granite 3.2-8B-Instruct - 高效推理场景适用的免费指令微调语言模型

首页

Granite 3.2 8b Instruct GGUF

由 Mungert 开发

IBM Granite系列8B参数指令微调语言模型，采用IQ-DynamicGate超低比特量化技术，适用于高效推理场景

大型语言模型开源协议:Apache-2.0 #超低比特量化 #精度自适应 #边缘设备推理

下载量 1,048

发布时间 : 3/19/2025

模型简介

该模型是IBM Granite系列的中等规模语言模型，经过指令微调优化，支持文本生成任务。采用创新的IQ-DynamicGate量化技术，可在1-2比特精度下保持较高性能。

模型特点

IQ-DynamicGate量化技术

创新的1-2比特精度自适应量化方法，通过分层策略在保持内存效率的同时保留模型精度

混合精度分配

前25%和后25%层使用IQ4_XS，中间50%层使用IQ2_XXS/IQ3_S，关键组件使用Q5_K保护

高效推理

针对CPU和低显存设备优化，提供多种量化版本适应不同硬件环境

模型能力

文本生成

指令跟随

低资源推理

使用案例

边缘计算

移动设备AI助手

在内存受限的移动设备上部署智能助手

IQ1_M量化版本困惑度降低43.9%

研究开发

超低比特量化研究

作为1-2比特量化技术的研究平台

IQ2_S量化版本在仅增加0.2GB情况下降低36.9%困惑度

🚀 Granite-3.2-8B-Instruct GGUF模型

Granite-3.2-8B-Instruct是一款具有80亿参数的长上下文AI模型，经过微调后具备出色的思考能力。它基于Granite-3.1-8B-Instruct构建，使用了经过宽松许可的开源数据集和内部生成的用于推理任务的合成数据进行训练。该模型能够控制其思考能力，确保仅在需要时发挥作用。

模型基本信息

属性	详情
开发者	Granite团队，IBM
网站	Granite文档
发布日期	2025年2月26日
许可证	Apache 2.0
支持语言	英语、德语、西班牙语、法语、日语、葡萄牙语、阿拉伯语、捷克语、意大利语、韩语、荷兰语和中文。用户也可以针对这12种语言之外的语言对该模型进行微调。
预期用途	该模型旨在处理一般的指令跟随任务，并可集成到各个领域的AI助手，包括商业应用中。

🚀 快速开始

安装依赖库

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

代码示例

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

model_path="ibm-granite/granite-3.2-8b-instruct"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=torch.bfloat16,
    )
tokenizer = AutoTokenizer.from_pretrained(
        model_path
)

conv = [{"role": "user", "content":"You have 10 liters of a 30% acid solution. How many liters of a 70% acid solution must be added to achieve a 50% acid mixture?"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=True, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42)
output = model.generate(
    **input_ids,
    max_new_tokens=8192,
)

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)

示例输出

thinking=True

Here is my thought process:

1. We know we start with 10 liters of a 30% acid solution, so there's 3 liters of pure acid in it (10 liters * 0.3).
2. We need to determine how much of the 70% acid solution to add such that the final mixture is 50% acid.
3. Let's say x liters of the 70% acid solution are added. This will contribute 0.7x liters of pure acid.
4. The total volume of the new mixture will be 10 + x liters, and the total amount of pure acid will be 3 + 0.7x liters.
5. We need to set up an equation where (3 + 0.7x) / (10 + x) = 0.5 (since we want a 50% acid mixture).
6. Solving this equation for x will give us the volume of 70% solution needed.

Here is my response:

To solve this, we need to set up an equation based on the concentration of acid before and after mixing. 

1. **Initial Acid Content**: 
   - We start with 10 liters of a 30% acid solution. 
   - Pure acid in this = 10 * 0.3 = 3 liters.

2. **Variable for Addition**:
   - Let x be the liters of 70% acid solution to be added.
   - Pure acid in x liters of 70% solution = 0.7x liters.

3. **Final Mixture Concentration**:
   - Total volume of mixture = 10 + x liters.
   - Total pure acid in the mixture = 3 (from initial solution) + 0.7x (from added solution).

4. **Setting up the equation for 50% acid concentration**:
   - We require that the ratio of pure acid to total volume is 0.5 (50%).
   - Therefore, the equation is:

   \[
   \frac{3 + 0.7x}{10 + x} = 0.5
   \]

5. **Solving the equation**:

   Multiply both sides by (10 + x) to clear the fraction:

   \[
   3 + 0.7x = 0.5(10 + x)
   \]

   Expanding the right side:

   \[
   3 + 0.7x = 5 + 0.5x
   \]

   Subtract 0.5x from both sides:

   \[
   3 + 0.2x = 5
   \]

   Subtract 3 from both sides:

   \[
   0.2x = 2
   \]

   Divide by 0.2:

   \[
   x = 10
   \]

So, 10 liters of the 70% acid solution must be added to the initial 10 liters of 30% solution to get a 50% acid mixture. 

Let's verify: 

- Initial acid: 3 liters
- Acid from added solution: 0.7 * 10 = 7 liters
- Total acid: 3 + 7 = 10 liters
- Total volume: 10 + 10 = 20 liters
- Acid percentage: \(\frac{10}{20} = 0.5 = 50\%\)

This confirms our solution. 

Final answer: 10 liters of 70% acid solution should be added.

thinking=False

Let's denote the amount of 70% acid solution to be added as x liters.

We start with 10 liters of a 30% acid solution, which means there are 0.3 * 10 = 3 liters of pure acid in it.

We want to end up with a 50% acid solution after adding x liters of 70% acid solution. The total volume of the mixture will be 10 + x liters, and we want half of this volume to be pure acid.

So, the amount of pure acid in the final mixture should be 0.5 * (10 + x) liters.

The amount of pure acid in the final mixture comes from two sources: the original 30% solution and the added 70% solution. So, we can set up the following equation:

3 (from the original solution) + 0.7x (from the added solution) = 0.5 * (10 + x)

Now, let's solve for x:

3 + 0.7x = 5 + 0.5x
0.7x - 0.5x = 5 - 3
0.2x = 2
x = 2 / 0.2
x = 10

So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% acid solution to get a 50% acid mixture.

✨ 主要特性

超低比特量化与IQ-DynamicGate (1 - 2比特)

我们最新的量化方法为超低比特模型（1 - 2比特）引入了精度自适应量化，并在Llama-3-8B上通过基准测试证明了其有效性。这种方法采用特定层的策略，在保持极高内存效率的同时保留了模型的准确性。

基准测试环境

所有测试均在Llama-3-8B-Instruct上进行，使用以下设置：

标准困惑度评估流程
2048令牌的上下文窗口
所有量化方法使用相同的提示集

量化方法

动态精度分配：
- 前/后25%的层 → IQ4_XS（选定层）
- 中间50% → IQ2_XXS/IQ3_S（提高效率）
关键组件保护：
- 嵌入层/输出层使用Q5_K
- 与标准的1 - 2比特量化相比，误差传播降低了38%

量化性能对比（Llama-3-8B）

量化方式	标准困惑度	DynamicGate困惑度	困惑度变化	标准大小	DG大小	大小变化	标准速度	DG速度
IQ2_XXS	11.30	9.84	-12.9%	2.5G	2.6G	+0.1G	234s	246s
IQ2_XS	11.72	11.63	-0.8%	2.7G	2.8G	+0.1G	242s	246s
IQ2_S	14.31	9.02	-36.9%	2.7G	2.9G	+0.2G	238s	244s
IQ1_M	27.46	15.41	-43.9%	2.2G	2.5G	+0.3G	206s	212s
IQ1_S	53.07	32.00	-39.7%	2.1G	2.4G	+0.3G	184s	209s

关键指标说明：

PPL = 困惑度（越低越好）
Δ PPL = 从标准量化到DynamicGate量化的困惑度变化百分比
速度 = 推理时间（CPU avx2，2048令牌上下文）
大小差异反映了混合量化的开销

主要改进：

🔥 IQ1_M的困惑度大幅降低了43.9%（从27.46降至15.41）
🚀 IQ2_S的困惑度降低了36.9%，同时仅增加了0.2GB的大小
⚡ IQ1_S尽管采用了1比特量化，但仍保持了39.7%的更高准确性

权衡因素：

所有变体的大小都有适度增加（0.1 - 0.3GB）
推理速度保持相近（差异小于5%）

模型使用场景

📌 将模型装入GPU显存

✔ 内存受限的部署环境

✔ 可以容忍1 - 2比特误差的CPU和边缘设备

✔ 超低比特量化的研究

📦 安装指南

安装相关依赖库：

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

📚 详细文档

选择合适的模型格式

选择正确的模型格式取决于你的硬件能力和内存限制。

BF16（Brain Float 16） – 若支持BF16加速则使用

一种16位浮点格式，专为更快的计算而设计，同时保留了较好的精度。
提供与FP32 相似的动态范围，但内存使用更低。
如果你的硬件支持BF16加速（请查看设备规格），建议使用。
与FP32相比，适用于高性能推理且内存占用减少的场景。

📌 适用场景： ✔ 你的硬件具有原生BF16支持（例如，较新的GPU、TPU）。 ✔ 你希望在节省内存的同时获得更高的精度。 ✔ 你计划将模型重新量化为其他格式。

📌 避免场景： ❌ 你的硬件不支持BF16（可能会回退到FP32并运行较慢）。 ❌ 你需要与缺乏BF16优化的旧设备兼容。

F16（Float 16） – 比BF16更广泛支持

一种16位浮点格式，具有较高的精度，但动态范围小于BF16。
适用于大多数支持FP16加速的设备（包括许多GPU和一些CPU）。
数值精度略低于BF16，但通常足以进行推理。

📌 适用场景： ✔ 你的硬件支持FP16但不支持BF16。 ✔ 你需要在速度、内存使用和准确性之间取得平衡。 ✔ 你在GPU或其他针对FP16计算优化的设备上运行。

📌 避免场景： ❌ 你的设备缺乏原生FP16支持（可能会比预期运行更慢）。 ❌ 你有内存限制。

量化模型（Q4_K、Q6_K、Q8等） – 用于CPU和低显存推理

量化可以在尽可能保持准确性的同时减少模型大小和内存使用。

低比特模型（Q4_K） → 最适合最小化内存使用，可能精度较低。
高比特模型（Q6_K、Q8_0） → 准确性更好，但需要更多内存。

📌 适用场景： ✔ 你在CPU上进行推理，需要优化的模型。 ✔ 你的设备显存较低，无法加载全精度模型。 ✔ 你希望在保持合理准确性的同时减少内存占用。

📌 避免场景： ❌ 你需要最高的准确性（全精度模型更适合）。 ❌ 你的硬件有足够的显存支持更高精度的格式（BF16/F16）。

极低比特量化（IQ3_XS、IQ3_S、IQ3_M、Q4_K、Q4_0）

这些模型针对极端内存效率进行了优化，非常适合低功耗设备或大规模部署，其中内存是关键限制因素。

IQ3_XS：超低比特量化（3比特），具有极高的内存效率。
- 使用场景：最适合超低内存设备，即使Q4_K也太大的情况。
- 权衡因素：与高比特量化相比，准确性较低。
IQ3_S：小块大小，以实现最大内存效率。
- 使用场景：最适合低内存设备，当IQ3_XS过于激进时。
IQ3_M：中等块大小，比IQ3_S具有更好的准确性。
- 使用场景：适用于低内存设备，当IQ3_S限制过多时。
Q4_K：4比特量化，具有逐块优化以提高准确性。
- 使用场景：最适合低内存设备，当Q6_K太大时。
Q4_0：纯4比特量化，针对ARM设备进行了优化。
- 使用场景：最适合基于ARM的设备或低内存环境。

模型格式选择总结表

模型格式	精度	内存使用	设备要求	最佳使用场景
BF16	最高	高	支持BF16的GPU/CPU	减少内存的高速推理
F16	高	高	支持FP16的设备	当BF16不可用时的GPU推理
Q4_K	中低	低	CPU或低显存设备	内存受限的环境
Q6_K	中等	适中	内存较多的CPU	量化模型中准确性较好的选择
Q8_0	高	适中	有足够显存的CPU或GPU	量化模型中准确性最高的选择
IQ3_XS	非常低	非常低	超低内存设备	极端内存效率和低准确性
Q4_0	低	低	ARM或低内存设备	llama.cpp可以针对ARM设备进行优化

包含的文件及详情

`granite-3.2-8b-instruct-bf16.gguf`

模型权重以BF16格式保存。
如果你想将模型重新量化为其他格式，请使用此文件。
如果你的设备支持BF16加速，这是最佳选择。

`granite-3.2-8b-instruct-f16.gguf`

模型权重以F16格式存储。
如果你的设备支持FP16，尤其是当BF16不可用时，请使用此文件。

`granite-3.2-8b-instruct-bf16-q8_0.gguf`

输出层和嵌入层保持为BF16。
所有其他层量化为Q8_0。
如果你的设备支持BF16，并且你想要一个量化版本，请使用此文件。

`granite-3.2-8b-instruct-f16-q8_0.gguf`

输出层和嵌入层保持为F16。
所有其他层量化为Q8_0。

`granite-3.2-8b-instruct-q4_k.gguf`

输出层和嵌入层量化为Q8_0。
所有其他层量化为Q4_K。
适用于内存有限的CPU推理。

`granite-3.2-8b-instruct-q4_k_s.gguf`

最小的Q4_K变体，以牺牲准确性为代价使用更少的内存。
最适合极低内存的设置。

`granite-3.2-8b-instruct-q6_k.gguf`

输出层和嵌入层量化为Q8_0。
所有其他层量化为Q6_K。

`granite-3.2-8b-instruct-q8_0.gguf`

完全Q8量化的模型，以获得更好的准确性。
需要更多的内存，但提供更高的精度。

`granite-3.2-8b-instruct-iq3_xs.gguf`

IQ3_XS量化，针对极端内存效率进行了优化。
最适合超低内存设备。

`granite-3.2-8b-instruct-iq3_m.gguf`

IQ3_M量化，提供中等块大小以获得更好的准确性。
适用于低内存设备。

`granite-3.2-8b-instruct-q4_0.gguf`

纯Q4_0量化，针对ARM设备进行了优化。
最适合低内存环境。
若追求更高准确性，建议使用IQ4_NL。

🔧 技术细节

评估结果

模型	ArenaHard	Alpaca-Eval-2	MMLU	PopQA	TruthfulQA	BigBenchHard	DROP	GSM8K	HumanEval	HumanEval+	IFEval	AttaQ
Llama-3.1-8B-Instruct	36.43	27.22	69.15	28.79	52.79	72.66	61.48	83.24	85.32	80.15	79.10	83.43
DeepSeek-R1-Distill-Llama-8B	17.17	21.85	45.80	13.25	47.43	65.71	44.46	72.18	67.54	62.91	66.50	42.87
Qwen-2.5-7B-Instruct	25.44	30.34	74.30	18.12	63.06	70.40	54.71	84.46	93.35	89.91	74.90	81.90
DeepSeek-R1-Distill-Qwen-7B	10.36	15.35	50.72	9.94	47.14	65.04	42.76	78.47	79.89	78.43	59.10	42.45
Granite-3.1-8B-Instruct	37.58	30.34	66.77	28.7	65.84	68.55	50.78	79.15	89.63	85.79	73.20	85.73
Granite-3.1-2B-Instruct	23.3	27.17	57.11	20.55	59.79	54.46	18.68	67.55	79.45	75.26	63.59	84.7
Granite-3.2-2B-Instruct	24.86	34.51	57.18	20.56	59.8	52.27	21.12	67.02	80.13	73.39	61.55	83.23
Granite-3.2-8B-Instruct	55.25	61.19	66.79	28.04	66.92	64.77	50.95	81.65	89.35	85.72	74.31	85.42

训练数据

总体而言，我们的训练数据主要由两个关键来源组成：（1）具有宽松许可的公开可用数据集，（2）旨在增强推理能力的内部合成生成数据。

基础设施

我们使用IBM的超级计算集群Blue Vela来训练Granite-3.2-8B-Instruct，该集群配备了NVIDIA H100 GPU。这个集群为在数千个GPU上训练我们的模型提供了可扩展且高效的基础设施。

伦理考虑和局限性

Granite-3.2-8B-Instruct基于Granite-3.1-8B-Instruct构建，利用了经过宽松许可的开源数据和部分专有数据以提高性能。由于它继承了前一个模型的基础，所有适用于Granite-3.1-8B-Instruct的伦理考虑和局限性仍然适用。

📄 许可证

本项目采用Apache 2.0许可证。

其他信息

测试AI网络监控助手

如果你觉得这些模型有用，请帮忙测试我的由AI驱动的网络监控助手，它具备量子就绪的安全检查功能： 👉 免费网络监控器

测试方法

点击聊天图标（任何页面的右下角）
选择一个AI助手类型：
- TurboLLM (GPT-4-mini)
- FreeLLM (开源)
- TestLLM (仅支持CPU的实验性模型)

测试内容

我正在探索小型开源模型在AI网络监控中的极限，具体包括：

针对实时网络服务的函数调用
模型可以缩小到多小，同时仍能处理：
- 自动化Nmap扫描
- 量子就绪检查
- Metasploit集成

实验性模型TestLLM（llama.cpp在6个CPU线程上运行）

✅ 零配置设置
⏳ 30秒加载时间（推理速度慢，但无API成本）
🔧 寻求帮助！ 如果你对边缘设备AI感兴趣，让我们一起合作！

其他助手

🟢 TurboLLM – 使用gpt-4-mini进行：
- 实时网络诊断
- 自动化渗透测试 (Nmap/Metasploit)
- 🔑 通过下载我们的免费网络监控代理获得更多令牌
🔵 HugLLM – 开源模型（约80亿参数）：
- 比TurboLLM多2倍的令牌
- 由AI驱动的日志分析
- 🌐 在Hugging Face推理API上运行