Meta-Llama-3-70B-Instruct量化版开源模型 - 英文商业研究助手聊天利器

首页

Meta Llama 3 70B Instruct Quantized.w8a16

由 RedHatAI 开发

Meta-Llama-3-70B-Instruct 的量化版本，主要用于英文的商业和研究用途，能高效地进行类助手聊天。

大型语言模型

Transformers

英语#INT8量化 #英文助手 #商业研究

下载量 1,035

发布时间 : 7/2/2024

模型简介

基于 Meta-Llama-3 架构的量化模型，通过 INT8 量化减少模型大小和 GPU 内存需求，适用于英文的商业和研究用途。

模型特点

INT8 量化

对 Transformer 块内线性算子的权重进行 INT8 量化，使磁盘大小和 GPU 内存需求降低约 50%。

高效部署

支持通过 vLLM 和 Transformers 高效部署，适用于多 GPU 环境。

高恢复率

在 OpenLLM 基准测试中，量化模型的性能恢复率达到 98.4%。

模型能力

文本生成

类助手聊天

商业用途

研究用途

使用案例

商业应用

客户服务助手

用于生成英文客户服务回复，提高响应效率。

研究应用

学术研究助手

辅助研究人员生成英文研究内容或摘要。

🚀 Meta-Llama-3-70B-Instruct 量化模型（w8a16）

Meta-Llama-3-70B-Instruct-quantized.w8a16 是 Meta-Llama-3-70B-Instruct 的量化版本，主要用于英文的商业和研究用途，能高效地进行类助手聊天。

✨ 主要特性

模型架构：基于 Meta-Llama-3 架构，输入和输出均为文本。
模型优化：对 Meta-Llama-3-70B-Instruct 的权重进行 INT8 量化，将每个参数的位数从 16 位减少到 8 位，使磁盘大小和 GPU 内存需求降低约 50%。仅对 Transformer 块内线性算子的权重进行量化，采用对称的逐通道量化。
适用场景：适用于英文的商业和研究用途，可用于类助手聊天。
不适用场景：禁止用于违反适用法律法规（包括贸易合规法律）的场景，且仅支持英文。
发布日期：2024 年 7 月 2 日
版本：1.0
许可证：Llama3
模型开发者：Neural Magic

该量化模型在 OpenLLM 基准测试（版本 1）中平均得分 77.90，而未量化模型得分为 79.18。

🚀 快速开始

使用 vLLM 部署

可以使用 vLLM 后端高效部署该模型，以下是使用 2 个 GPU 的示例代码：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
number_gpus = 2

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 还支持与 OpenAI 兼容的服务，更多详情请参阅文档。

使用 Transformers 部署

该模型可通过 Transformers 与 AutoGPTQ 数据格式的集成来使用，以下示例展示了如何使用 generate() 函数：

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

🔧 技术细节

模型创建

该模型使用 AutoGPTQ 库创建，代码示例如下：

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

num_samples = 128
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
    
quantize_config = BaseQuantizeConfig(
  bits=8,
  group_size=-1,
  desc_act=False,
  model_file_base_name="model",
  damp_percent=0.1,
)

model = AutoGPTQForCausalLM.from_pretrained(
  model_id,
  quantize_config,
  device_map="auto",
)

model.quantize(examples)
model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a16")

虽然此模型使用了 AutoGPTQ，但 Neural Magic 正在过渡到使用 llm-compressor，它支持多种量化方案和 AutoGPTQ 不支持的模型。

模型评估

该模型在 OpenLLM 排行榜任务（版本 1）上进行了评估，使用 lm-evaluation-harness（提交版本 383bbd54bc621086e05aa1b030d8d4d5635b25e6）和 vLLM 引擎，使用以下命令（使用 8 个 GPU）：

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16",tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
  --tasks openllm \
  --batch_size auto

准确率

Open LLM 排行榜评估得分

基准测试	Meta-Llama-3-70B-Instruct	Meta-Llama-3-70B-Instruct-quantized.w8a16（本模型）	恢复率
MMLU (5-shot)	80.18	78.69	98.1%
ARC Challenge (25-shot)	72.44	71.59	98.8%
GSM-8K (5-shot, strict-match)	90.83	86.43	95.2%
Hellaswag (10-shot)	85.54	85.65	100.1%
Winogrande (5-shot)	83.19	83.11	98.8%
TruthfulQA (0-shot)	62.92	61.94	98.4%
平均	79.18	77.90	98.4%