MiniCPM4-8B-GGUF开源大语言模型 - 端侧设备适用，多维度创新提效

首页

Minicpm4 8B GGUF

由 Mungert 开发

MiniCPM4-8B是专为端侧设备设计的高效大语言模型，通过模型架构、训练数据、训练算法和推理系统四个维度的创新，实现了极致的效率提升。

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #端侧高效推理 #长文本稀疏注意力 #极致量化优化

下载量 906

发布时间 : 6/13/2025

模型简介

MiniCPM4-8B是一个80亿参数的大语言模型，在8T个token上进行训练，专为边缘设备优化，支持高达32,768个token的上下文长度，并可通过RoPE缩放技术扩展至131,072个token。

模型特点

高效稀疏注意力机制

采用InfLLM v2可训练的稀疏注意力机制，在处理128K长文本时每个token只需与不到5%的token计算相关性，显著减少计算开销。

极致量化技术

支持BitCPM极致三元量化，将模型参数压缩为三元值，实现90%的比特宽度减少。

长上下文支持

原生支持32,768个token上下文长度，通过LongRoPE技术可扩展至131,072个token。

端侧优化

专为边缘设备设计，在典型端侧芯片上可实现超过5倍的生成加速。

模型能力

长文本生成

多轮对话

知识密集型任务处理

推理密集型任务处理

工具调用

使用案例

内容生成

文章写作

根据用户提示生成高质量的长篇文章

可生成结构完整、逻辑清晰的专业文章

智能助手

旅游推荐

为用户推荐旅游景点并提供详细介绍

能生成包含多个景点的详细推荐列表

学术研究

文献综述

根据用户查询自主生成可信的长篇调查论文

可生成结构完整的学术综述

🚀 MiniCPM4-8B GGUF模型

MiniCPM4-8B GGUF模型是专为端侧设备设计的高效大语言模型，在模型架构、训练数据、训练算法和推理系统四个关键维度进行了系统创新，实现了极致的效率提升。

🚀 快速开始

模型生成详情

本模型使用 llama.cpp 在提交版本 7f4fbe51 时生成。

超IMatrix量化

我一直在尝试一种新的量化方法，该方法有选择地提高关键层的精度，超越了默认IMatrix配置所提供的精度。

在我的测试中，标准的IMatrix量化在低比特深度下表现不佳，尤其是在混合专家（MoE）模型中。为了解决这个问题，我使用 llama.cpp 中的 --tensor-type 选项手动将重要层的精度提高。你可以在以下链接查看实现：使用llama.cpp进行层提升

虽然这确实会增加模型文件的大小，但它显著提高了给定量化级别的精度。

选择合适的GGUF模型格式

点击此处获取选择合适GGUF模型格式的信息

模型相关资源

GitHub仓库 | 技术报告

加入我们的 Discord 和微信社区

✨ 主要特性

MiniCPM4系列模型

MiniCPM4系列是专门为端侧设备设计的高效大语言模型，通过在模型架构、训练数据、训练算法和推理系统四个关键维度进行系统创新，实现了这种效率。

MiniCPM4-8B：MiniCPM4的旗舰模型，拥有80亿参数，在8T个token上进行训练。（<-- 你正在查看此模型）
MiniCPM4-0.5B：MiniCPM4的小型版本，拥有0.5亿参数，在1T个token上进行训练。
MiniCPM4-8B-Eagle-FRSpec：用于FRSpec的Eagle头，加速MiniCPM4-8B的推测推理。
MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu：使用QAT为FRSpec训练的Eagle头，有效整合推测和量化，为MiniCPM4-8B实现超加速。
MiniCPM4-8B-Eagle-vLLM：vLLM格式的Eagle头，加速MiniCPM4-8B的推测推理。
MiniCPM4-8B-marlin-Eagle-vLLM：用于vLLM格式的量化Eagle头，加速MiniCPM4-8B的推测推理。
BitCPM4-0.5B：将极端三元量化应用于MiniCPM4-0.5B，将模型参数压缩为三元值，实现了90%的比特宽度减少。
BitCPM4-1B：将极端三元量化应用于MiniCPM3-1B，将模型参数压缩为三元值，实现了90%的比特宽度减少。
MiniCPM4-Survey：基于MiniCPM4-8B，接受用户的查询作为输入，并自主生成可信的长篇调查论文。
MiniCPM4-MCP：基于MiniCPM4-8B，接受用户的查询和可用的MCP工具作为输入，并自主调用相关的MCP工具以满足用户的需求。

模型优势

MiniCPM 4是一个极其高效的边缘侧大模型，在模型架构、学习算法、训练数据和推理系统四个维度进行了高效优化，实现了极致的效率提升。

高效的模型架构：
- InfLLM v2 -- 可训练的稀疏注意力机制：采用可训练的稀疏注意力机制架构，在处理128K长文本时，每个token只需要与不到5%的token计算相关性，显著减少了长文本的计算开销。
高效的学习算法：
- 模型风洞2.0 -- 高效可预测的缩放：引入下游任务性能的缩放预测方法，实现更精确的模型训练配置搜索。
- BitCPM -- 极致三元量化：将模型参数的比特宽度压缩为3个值，实现了90%的极端模型比特宽度减少。
- 高效训练工程优化：采用FP8低精度计算技术，结合多token预测训练策略。
高质量的训练数据：
- UltraClean -- 高质量预训练数据过滤和生成：基于高效的数据验证构建迭代数据清理策略，开源高质量的中英预训练数据集 UltraFinweb。
- UltraChat v2 -- 高质量监督微调数据生成：构建大规模高质量的监督微调数据集，涵盖知识密集型数据、推理密集型数据、指令跟随数据、长文本理解数据和工具调用数据等多个维度。
高效的推理系统：
- CPM.cu -- 轻量级高效的CUDA推理框架：集成稀疏注意力、模型量化和推测采样，实现高效的预填充和解码。
- ArkInfer -- 跨平台部署系统：支持在多个后端环境中进行高效部署，提供灵活的跨平台适应能力。

📦 安装指南

安装CPM.cu

git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install

安装infllmv2_cuda_impl

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install

安装SGLang

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

安装vLLM

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

💻 使用示例

使用CPM.cu进行推理

# 启用LongRoPE
{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

修改后，你可以运行以下命令来重现长上下文加速效果（脚本将自动从HuggingFace下载模型权重）：

python3 tests/test_generate.py

使用Transformers进行推理

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# 用户可以直接使用聊天界面
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)

# 用户也可以使用生成界面
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

启用InfLLM v2

MiniCPM4-8B支持 InfLLM v2，这是一种专为高效长序列推理设计的稀疏注意力机制。它需要 infllmv2_cuda_impl 库。

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install

要启用InfLLM v2，你需要在 config.json 中添加 sparse_config 字段：

{
    ...,
    "sparse_config": {
        "kernel_size": 32,
        "kernel_stride": 16,
        "init_blocks": 1,
        "block_size": 64,
        "window_size": 2048,
        "topk": 64,
        "use_nope": false,
        "dense_len": 8192
    }
}

这些参数控制InfLLM v2的行为：

kernel_size（默认值：32）：语义核的大小。
kernel_stride（默认值：16）：相邻核之间的步长。
init_blocks（默认值：1）：每个查询token关注的初始块数。这确保了对序列开头的关注。
block_size（默认值：64）：键值块的块大小。
window_size（默认值：2048）：局部滑动窗口的大小。
topk（默认值：64）：指定每个token只与最相关的前k个键值块计算注意力。
use_nope（默认值：false）：是否在块选择中使用NOPE技术以提高性能。
dense_len（默认值：8192）：由于稀疏注意力对短序列的好处有限，模型可以对较短的文本使用标准（密集）注意力。模型将对token长度低于 dense_len 的序列使用密集注意力，并对超过此长度的序列切换到稀疏注意力。将此值设置为 -1 以始终使用稀疏注意力，而不管序列长度如何。

使用SGLang进行推理

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

启动推理服务器：

python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml

使用聊天界面：

import openai

client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
)

print(response.choices[0].message.content)

使用vLLM进行推理

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

启动推理服务器：

vllm serve openbmb/MiniCPM4-8B

使用聊天界面：

import openai

client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
    extra_body=dict(add_special_tokens=True),  # 确保为聊天模板添加特殊token
)

print(response.choices[0].message.content)

📚 详细文档

长文本处理

MiniCPM4原生支持高达32,768个token的上下文长度。对于总长度（包括输入和输出）显著超过此限制的对话，我们建议使用RoPE缩放技术来有效处理长文本。我们通过修改LongRoPE因子，验证了模型在高达131,072个token的上下文长度上的性能。

你可以通过修改模型文件来应用LongRoPE因子修改。具体来说，在 config.json 文件中，调整 rope_scaling 字段。

{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

vLLM推理注意事项

⚠️ 重要提示

在vLLM的聊天API中，add_special_tokens 默认值为 False。这意味着重要的特殊token（如序列开始（BOS）token）不会自动添加。为确保输入提示正确格式化以适应模型，你应该明确设置 extra_body={"add_special_tokens": True}。