Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4开源模型 - 免费部署提升生成响应有用性

首页

Nvidia Llama 3.1 Nemotron 70B Instruct HF AWQ INT4

由 ibnzterrell 开发

这是 NVIDIA 基于 Meta Llama-3.1-70B-Instruct 定制的 Llama-3.1-Nemotron-70B-Instruct 模型的 AWQ 4位量化版本，专注于提升生成响应的有用性。

大型语言模型

Transformers

支持多种语言#多语言指令优化 #70B参数量化 #高性能对话生成

下载量 206

发布时间 : 10/24/2024

模型简介

该模型是大型语言模型，经过优化以提供高质量的回答，支持多种语言，适用于文本生成任务。

模型特点

高性能量化

使用 AutoAWQ 从 FP16 量化至 INT4，采用 GEMM 内核、零点量化和 128 的分组大小，优化推理效率。

多语言支持

支持包括英语、德语、法语、西班牙语等在内的多种语言，适用于国际化应用场景。

强化对齐训练

使用 RLHF 和 HelpSteer2-Preference prompts 进行强化学习对齐训练，提升生成响应的有用性。

模型能力

文本生成

多语言支持

对话系统

使用案例

对话系统

智能客服

用于构建多语言智能客服系统，提供高质量的回答。

在 Arena Hard 上达到 85.0 分，AlpacaEval 2 LC 上达到 57.6 分。

内容生成

多语言内容创作

生成高质量的多语言文本内容，适用于新闻、博客等。

🚀 Llama 3.1-Nemotron-70B-Instruct-HF AWQ量化模型

本项目提供了nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本。该模型是NVIDIA基于Meta AI发布的meta-llama/Meta-Llama-3.1-70B-Instruct定制的大语言模型。量化后的模型能在特定硬件上高效运行，同时保留了原模型的高性能。

🚀 快速开始

本仓库是nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本，该模型是NVIDIA对meta-llama/Meta-Llama-3.1-70B-Instruct的定制版本，最初由Meta AI发布。

此模型使用AutoAWQ从FP16量化到INT4，采用GEMM内核、零点量化和128的分组大小。

硬件要求：Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz、256GB RAM和2块NVIDIA RTX 3090。任何支持LLama 3.1 70B Instruct AWQ INT4的平台都应能运行该模型。

以下是Transformers、AutoAWQ、Text Generation Interface (TGI)和vLLM的模型使用（推理）信息，以及量化复现细节。

✨ 主要特性

量化模型特性

高效压缩：通过AutoAWQ将模型从FP16量化到INT4，减少了内存占用。
广泛兼容：适用于支持LLama 3.1 70B Instruct AWQ INT4的平台。

原始模型特性

高性能：在多个基准测试中表现出色，如在Arena Hard中达到85.0，AlpacaEval 2 LC中达到57.6，[GPT - 4 - Turbo MT - Bench](https://github.com/lm - sys/FastChat/pull/3158)中达到8.98。
排名领先：截至2024年10月1日，在三个自动对齐基准测试中排名第一；截至2024年10月24日，在ChatBot Arena排行榜上Elo得分为1267(±7)，排名第9，风格控制排名第26。

📦 安装指南

Transformers

运行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，需安装以下包：

pip install -q --upgrade transformers autoawq accelerate

AutoAWQ

运行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，需安装以下包：

pip install -q --upgrade transformers autoawq accelerate

Text Generation Inference (TGI)

运行text - generation - launcher，需安装Docker（见安装说明）和huggingface_hub Python包，并登录Hugging Face Hub：

pip install -q --upgrade huggingface_hub
huggingface-cli login

vLLM

运行vLLM，需安装Docker（见安装说明）。

💻 使用示例

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # 注意：根据用例更新此值
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

AutoAWQ

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  device_map="auto",
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

Text Generation Inference (TGI)

docker run --gpus all --shm-size 1g -ti -p 8080:80 \
  -v hf_cache:/data \
  -e MODEL_ID=ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  -e NUM_SHARD=4 \
  -e QUANTIZE=awq \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  -e MAX_INPUT_LENGTH=4000 \
  -e MAX_TOTAL_TOKENS=4096 \
  ghcr.io/huggingface/text-generation-inference:2.2.0

发送请求示例：

curl 0.0.0.0:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

vLLM

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 \
  --tensor-parallel-size 4 \
  --max-model-len 4096

发送请求示例：

curl 0.0.0.0:8000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ],
    "max_tokens": 128
  }'

🔧 技术细节

量化细节

此模型使用AutoAWQ从FP16量化到INT4，采用GEMM内核、零点量化和128的分组大小。

硬件要求

CPU：Intel Xeon CPU E5 - 2699A v4 @ 2.40GHz
内存：256GB RAM
GPU：2块NVIDIA RTX 3090

量化复现

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

# 清空缓存
torch.cuda.empty_cache()

# 内存限制 - 根据硬件限制设置
max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}

model_path = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
quant_path = "ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4"
quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
  
}

# 加载模型 - 注意：虽然将层加载到CPU，但量化仍需要GPU和VRAM！(通过nvida-smi验证)
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    use_cache=False,
    max_memory=max_memory,
    device_map="cpu"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化
model.quantize(
    tokenizer,
    quant_config=quant_config
)

# 保存量化模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'模型已量化并保存到 "{quant_path}"')

📄 许可证

本模型使用llama3.1许可证。

注意事项

⚠️ 重要提示

本仓库是nvidia/Llama-3.1-Nemotron-70B-Instruct-HF模型的AWQ 4位量化版本。

⚠️ 重要提示

运行Llama 3.1 Nemotron 70B Instruct AWQ INT4推理，加载模型检查点大约需要35 GiB的VRAM，不包括KV缓存或CUDA图，即应确保有略多于该大小的VRAM可用。

⚠️ 重要提示

要使用AutoAWQ对Llama 3.1 Nemotron 70B Instruct进行量化，需要使用至少有足够CPU RAM来容纳整个模型（约140GiB）的实例，以及具有40GiB VRAM的NVIDIA GPU进行量化。

使用建议

💡 使用建议

在运行推理时，根据实际硬件情况调整代码中的参数，如fuse_max_seq_len、max_memory等。

模型信息表格

属性	详情
模型类型	Llama-3.1-Nemotron-70B-Instruct-HF AWQ INT4量化模型
训练数据	HelpSteer2-Preference prompts
基础模型	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
支持语言	英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语、泰语
库名称	transformers
任务标签	文本生成
标签	llama-3.1、meta、autoawq