DeepSeek-R1T-Chimera-GGUF开源大语言模型 - 减少内存占用且性能高

首页

Deepseek R1T Chimera GGUF

由 ubergarm 开发

DeepSeek - R1T - Chimera是一个高质量的大语言模型，通过ik_llama.cpp提供的先进量化方案，在保持性能的同时显著减少内存占用。

大型语言模型其他开源协议:MIT #非线性量化优化 #专家路由架构 #多GPU支持

下载量 206

发布时间 : 5/14/2025

模型简介

该模型采用了先进的非线性量化方法，特别适合需要高效文本生成能力的应用场景。

模型特点

高质量量化

采用ik_llama.cpp提供的先进量化方案，在给定的内存占用下实现一流的性能

混合专家架构

采用MoE架构，包含共享专家和路由专家层，提高模型效率

多GPU支持

量化方案未预先打包，允许多GPU用户轻松卸载额外的层

模型能力

高效文本生成

支持长上下文处理

多专家路由推理

使用案例

文本生成

创意写作

生成高质量的长篇创意文本内容

技术文档生成

自动生成技术文档和说明

🚀 `ik_llama.cpp`对tngtech/DeepSeek - R1T - Chimera的imatrix量化

本量化集合为tngtech/DeepSeek - R1T - Chimera模型提供了高质量的量化方案，在给定的内存占用下实现了一流的性能。它需要特定的工具支持，并且为用户提供了出色的文本生成能力。

🚀 快速开始

此量化集合必须使用ik_llama.cpp的分支版本，以支持先进的非线性最先进量化方法。请不要下载这些大文件后，期望它们能在主线版本的vanilla llama.cpp、ollama、LM Studio、KoboldCpp等上运行！

注意：如果你想在下载我的量化文件之前进行尝试，ik_llama.cpp也可以运行你现有的来自bartowski、unsloth、mradermacher等的GGUF文件。

✨ 主要特性

高质量量化：这些量化方案在给定的内存占用下提供了一流的质量。
先进的量化支持：需要使用ik_llama.cpp分支来支持先进的非线性最先进量化。

📚 详细文档

量化集合

到目前为止，这些是我最好的量化方案，在良好的内存占用断点下提供了出色的质量。

DeepSeek - R1T - Chimera - IQ4_KS

注意：此量化文件可能需要很长时间才能上传，希望不超过一个月，哈哈...

文件大小：338.456 GiB (4.326 BPW)

类型f32：361个张量 - 规范等。
类型q6_0：61个张量 - attn_k_b（不能被256整除，所以不能使用iq6_k）
类型iq6_k：551个张量 - 注意力、令牌嵌入、输出、输出规范、共享专家等。
类型iq4_ks：174个张量 - ffn_(down|gate|up)_exps路由专家。

此量化方案旨在利用更快的iq4_ks CUDA性能，并且未预先打包，允许多GPU用户轻松卸载额外的层。如果你有足够的RAM来容纳它，你可以使用-rtr在CPU上对剩余层进行运行时重新打包以提高性能，或者使用离线重新打包工具为你的精确硬件配置定制解决方案。

量化

👈 秘密配方

#!/usr/bin/env bash

custom="
# Token embedding and output tensors
# note token_embd cannot be repacked quant type
token_embd\.weight=iq6_k
output\.weight=iq6_k
output_norm\.weight=iq6_k

# First 3 dense layers (0-3)
blk\.[0-2]\.attn_k_b.*=q6_0
blk\.[0-2]\.attn_.*=iq6_k
blk\.[0-2]\..*=iq6_k

# All attention, norm weights, and bias tensors for MoE layers (3-60)
# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
blk\.[3-9]\.attn_k_b.*=q6_0
blk\.[1-5][0-9]\.attn_k_b.*=q6_0
blk\.60\.attn_k_b.*=q6_0

blk\.[3-9]\.attn_.*=iq6_k
blk\.[1-5][0-9]\.attn_.*=iq6_k
blk\.60\.attn_.*=iq6_k

blk\.[3-9]\.ffn_norm\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
blk\.60\.ffn_norm\.weight=iq6_k

blk\.[3-9]\.exp_probs_b\.bias=iq6_k
blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
blk\.60\.exp_probs_b\.bias=iq6_k

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
blk\.60\.ffn_down_shexp\.weight=iq6_k

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k

# The bulk of the model size is below
# Routed Experts (3-60)
# usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
blk\.60\.ffn_down_exps\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama - quantize \
    --imatrix /mnt/models/ubergarm/DeepSeek - R1T - Chimera - GGUF/DeepSeek - R1T - Chimera.imatrix \
    --custom - q "$custom" \
    /media/b/data2/models/ubergarm/DeepSeek - R1T - Chimera - GGUF/DeepSeek - R1T - Chimera - 256x21B - BF16 - 00001 - of - 00030.gguf \
    /media/b/data2/models/ubergarm/DeepSeek - R1T - Chimera - GGUF/DeepSeek - R1T - Chimera - IQ4_KS.gguf \
    IQ4_KS \
    40

imatrix

基于关于imatrix方法的一些讨论，我选择了经过验证的传统方法，使用默认上下文长度512。这是使用针对MLA更新的imatrix计算修复生成的第一批imatrix之一，因此鉴于那里的讨论和最近CUDA速度的改进，对于这个MLA量化（iq6_k），在注意力张量上采用了比Q8_0更低的量化级别。

👈 Imatrix方法

wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt

numactl -N 0 -m 0 \
./build/bin/llama - imatrix \
    --verbosity 1 \
    -m /media/b/data2/models/ubergarm/DeepSeek - R1T - Chimera - GGUF/DeepSeek - R1T - Chimera - Q8_0.gguf \
    -f calibration_data_v5_rc.txt \
    -o DeepSeek - R1T - Chimera.imatrix \
    --layer - similarity \
    --ctx - size 512 \
    --numa numactl \
    --threads 40

# 注意：我实际上忘记了--layer - similarity，否则会在这里发布。抱歉！