license: mit
tags:
- deepseek
- int8
- vllm
- llmcompressor
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
library_name: transformers
DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8
模型概述
- 模型架构: Qwen2ForCausalLM
- 模型优化:
- 发布日期: 2025年2月4日
- 版本: 1.0
- 模型开发者: Neural Magic
DeepSeek-R1-Distill-Qwen-14B的量化版本。
模型优化
该模型通过对DeepSeek-R1-Distill-Qwen-14B的权重和激活进行INT8量化获得。该优化将表示权重和激活的比特数从16位减少到8位,降低了GPU内存需求(约50%),并提高了矩阵乘法计算吞吐量(约2倍)。权重量化还将磁盘空间需求减少了约50%。
仅对transformer块内线性运算符的权重和激活进行量化。权重采用逐通道对称量化方案,而量化采用逐令牌对称方案。量化应用了GPTQ算法,该算法在llm-compressor库中实现。
与vLLM一起使用
该模型可以使用vLLM后端高效部署,如下例所示。
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
number_gpus = 1
model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
messages_list = [
[{"role": "user", "content": "你是谁?请用海盗的语气回答!"}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
vLLM还支持与OpenAI兼容的服务。更多详情请参阅文档。
创建
该模型通过运行以下代码片段使用llm-compressor创建。
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_name = model_stub.split("/")[-1]
num_samples = 1024
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_stub)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map="auto",
torch_dtype="auto",
)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.1,
),
]
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
save_path = model_name + "-quantized.w8a8
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"模型和分词器已保存至: {save_path}")
评估
该模型在OpenLLM Leaderboard V1和V2上进行了评估,使用以下命令:
OpenLLM Leaderboard V1:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--tasks openllm \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
OpenLLM Leaderboard V2:
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks leaderboard \
--write_out \
--batch_size auto \
--output_path output_dir \
--show_config
准确率
类别 |
指标 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 |
恢复率 |
推理 |
AIME 2024 (pass@1) |
66.67 |
66.31 |
99.46% |
MATH-500 (pass@1) |
94.66 |
94.68 |
100.02% |
GPQA Diamond (pass@1) |
59.35 |
58.32 |
98.26% |
平均分 |
73.56 |
73.1 |
99.37% |
OpenLLM V1 |
ARC-Challenge (Acc-Norm, 25-shot) |
58.79 |
57.85 |
98.4% |
GSM8K (Strict-Match, 5-shot) |
87.04 |
87.79 |
100.9% |
HellaSwag (Acc-Norm, 10-shot) |
81.51 |
81.04 |
99.4% |
MMLU (Acc, 5-shot) |
74.46 |
74.26 |
99.7% |
TruthfulQA (MC2, 0-shot) |
54.77 |
54.94 |
100.3% |
Winogrande (Acc, 5-shot) |
69.38 |
70.48 |
101.6% |
平均分 |
70.99 |
71.06 |
100.1% |
OpenLLM V2 |
IFEval (Inst Level Strict Acc, 0-shot) |
42.11 |
41.62 |
98.6% |
BBH (Acc-Norm, 3-shot) |
13.73 |
14.29 |
--- |
Math-Hard (Exact-Match, 4-shot) |
0.00 |
0.00 |
--- |
GPQA (Acc-Norm, 0-shot) |
35.07 |
37.22 |
106.2% |
MUSR (Acc-Norm, 0-shot) |
45.14 |
43.56 |
96.5% |
MMLU-Pro (Acc, 5-shot) |
34.86 |
33.63 |
96.5% |
平均分 |
34.21 |
34.12 |
99.7% |
编码 |
HumanEval (pass@1) |
78.90 |
78.40 |
99.4% |
HumanEval (pass@10) |
89.80 |
90.10 |
100.3% |
HumanEval+ (pass@10) |
72.60 |
72.40 |
99.7% |
HumanEval+ (pass@10) |
84.90 |
84.90 |
100.0% |
推理性能
该模型在单流和多流异步部署中实现了高达1.6倍的加速,具体取决于硬件和使用场景。以下性能基准测试使用vLLM版本0.7.2和GuideLLM进行。
基准测试命令
guidellm --model neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
单流性能(使用