license: mit
tags:
- 千问
- QwQ
- FP8
- vLLM
base_model: Qwen/QwQ-32B
library_name: transformers
QwQ-32B-FP8动态量化版
模型概览
- 模型架构: Qwen2ForCausalLM
- 模型优化:
- 发布日期: 2025年3月6日
- 版本号: 1.0
- 开发团队: Neural Magic
Qwen/QwQ-32B的量化版本。
优化技术
本模型通过对Qwen/QwQ-32B的权重和激活值进行FP8量化获得。该优化将每个参数从16位降至8位,使磁盘存储和GPU内存需求减少约50%。
仅对transformer模块中的线性算子进行量化处理:权重采用逐通道对称量化方案,激活值采用逐token对称量化方案。量化过程使用LLM压缩器实现。
vLLM部署指南
可通过vLLM后端高效部署本模型,示例如下:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 1
model_name = "neuralmagic/QwQ-32B-FP8-dynamic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "你是谁?请用海盗口吻回答!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
vLLM同时支持OpenAI兼容的服务接口,详见官方文档。
创建过程
使用llm-compressor运行以下代码创建本模型:
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
import os
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]
model = AutoModelForCausalLM.from_pretrained(
model_stub,
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_stub)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"],
)
oneshot(
model=model,
recipe=recipe,
)
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"模型与分词器已保存至: {save_path}")
精度表现
测试类别 |
评估指标 |
原版QwQ-32B |
FP8量化版 |
精度保持率 |
推理能力 |
AIME 2024 (pass@1) |
78.66 |
79.40 |
100.94% |
MATH-500 (pass@1) |
97.39 |
97.44 |
100.05% |
GPQA钻石级 (pass@1) |
64.72 |
63.21 |
97.66% |
综合得分 |
80.25 |
80.05 |
99.75% |