Llama-4-Scout-17B-16E-Instruct量化版开源！显存降75%，支持多语言图文生成

首页

Llama 4 Scout 17B 16E Instruct Quantized.w4a16

由 RedHatAI 开发

基于Llama-4-Scout-17B-16E-Instruct的INT4权重量化版本，显存需求降低75%，支持多语言图文生成任务

文本生成图像

Safetensors

支持多种语言开源协议:其他 #多模态图文生成 #INT4高效量化 #企业级部署优化

下载量 11.03k

发布时间 : 4/25/2025

模型简介

这是一个经过优化的多语言大语言模型，支持文本和图像输入，输出文本内容。模型经过INT4量化处理，显著降低资源需求。

模型特点

高效量化

采用INT4权重量化技术，显存需求降低约75%，磁盘空间需求同步减少75%

多语言支持

支持12种语言的图文生成任务，包括亚洲和欧洲主要语言

企业级部署

优化适配红帽企业AI平台，包括RHEL AI和Openshift AI

模型能力

文本生成

多语言处理

图文理解

使用案例

内容创作

多语言内容生成

为不同语言用户自动生成符合文化背景的内容

高效产出12种语言的优质内容

企业应用

企业知识问答

部署在企业内部的知识问答系统

快速响应员工查询，提高工作效率

🚀 Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

这是一个经过量化处理的模型，基于Llama-4-Scout-17B-16E-Instruct，能有效减少GPU内存和磁盘空间需求，支持多语言，可在多种平台上部署。

🔍 模型信息

属性	详情
库名称	vllm
支持语言	阿拉伯语、德语、英语、西班牙语、法语、印地语、印尼语、意大利语、葡萄牙语、泰语、他加禄语、越南语
基础模型	meta-llama/Llama-4-Scout-17B-16E-Instruct
任务类型	图像文本到文本
标签	facebook、meta、pytorch、llama、llama4、neuralmagic、redhat、llmcompressor、quantized、W4A16、INT4
许可证	其他（llama4）

🚀 快速开始

本模型可在多个平台上高效部署，以下是详细的部署说明。

💻 使用示例

vLLM部署示例

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM还支持OpenAI兼容服务，更多详情请参考文档。

Red Hat AI Inference Server部署示例

$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

更多详情请参考Red Hat AI Inference Server文档。

Red Hat Enterprise Linux AI部署示例

# 从Red Hat Registry通过docker下载模型
# 注意：除非指定--model-dir，否则模型将下载到~/.cache/instructlab/models
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5

# 通过ilab提供模型服务
ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16
  
# 与模型进行对话
ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16

更多详情请参考Red Hat Enterprise Linux AI文档。

Red Hat Openshift AI部署示例

# 使用ServingRuntime设置vllm服务器
# 保存为: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # 可选更改: 设置唯一名称
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.20-cuda # 如有需要更改。如果是AMD: quay.io/modh/vllm:rhoai-2.20-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# 将模型附加到vllm服务器。这是一个NVIDIA模板
# 保存为: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 # 可选更改
    serving.kserve.io/deploymentMode: RawDeployment
  name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16          # 指定模型名称。此值将用于在有效负载中调用模型
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# 这是特定于模型的
          memory: 8Gi		# 这是特定于模型的
          nvidia.com/gpu: '1'	# 这是特定于加速器的
        requests:			# 此块同样适用
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# 必须与上面的ServingRuntime名称匹配
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# 确保首先位于要部署模型的项目中
# oc project <项目名称>

# 应用两个资源以运行模型

# 应用ServingRuntime
oc apply -f vllm-servingruntime.yaml

# 应用InferenceService
oc apply -f qwen-inferenceservice.yaml

# 替换下面的<推理服务名称>和<集群入口域名>
# - 如果不确定，请运行`oc get inferenceservice`查找URL

# 使用curl调用服务器:
curl https://<推理服务名称>-predictor-default.<域名>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

更多详情请参考Red Hat Openshift AI文档。

🔧 技术细节

模型概述

模型架构：Llama4ForConditionalGeneration
- 输入：文本 / 图像
- 输出：文本
模型优化：
- 激活量化：无
- 权重量化：INT4
发布日期：2025年4月25日
版本：1.0
模型开发者：Red Hat (Neural Magic)

模型优化说明

本模型是通过将Llama-4-Scout-17B-16E-Instruct的权重量化为INT4数据类型得到的。这种优化将表示权重的位数从16位减少到4位，大约减少了75%的GPU内存需求，同时也将磁盘空间需求减少了约75%。权重量化使用了llm-compressor库。

📊 评估

本模型在OpenLLM排行榜任务（v1和v2）、长上下文RULER、多模态MMMU和多模态ChartQA上进行了评估。所有评估均通过lm-evaluation-harness进行。

评估详情

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

OpenLLM v2

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks leaderboard \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

Long Context RULER

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --batch_size auto

Multimodal MMMU

lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks mmmu_val \
  --apply_chat_template \
  --batch_size auto

Multimodal ChartQA

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks chartqa \
  --apply_chat_template \
  --batch_size auto

准确率

评估任务	恢复率 (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (本模型)
ARC-Challenge 25-shot	98.51	69.37	68.34
GSM8k 5-shot	100.4	90.45	90.90
HellaSwag 10-shot	99.67	85.23	84.95
MMLU 5-shot	99.75	80.54	80.34
TruthfulQA 0-shot	99.82	61.41	61.30
WinoGrande 5-shot	98.98	77.90	77.11
OpenLLM v1 平均得分	99.59	77.48	77.16
IFEval 0-shot 指令和提示准确率的平均值	99.51	86.90	86.47
Big Bench Hard 3-shot	99.46	65.13	64.78
Math Lvl 5 4-shot	99.22	57.78	57.33
GPQA 0-shot	100.0	31.88	31.88
MuSR 0-shot	100.9	42.20	42.59
MMLU-Pro 5-shot	98.67	55.70	54.96
OpenLLM v2 平均得分	99.54	56.60	56.34
MMMU 0-shot	100.6	53.44	53.78
ChartQA 0-shot 精确匹配	100.1	65.88	66.00
ChartQA 0-shot 宽松准确率	99.55	88.92	88.52
多模态平均得分	100.0	69.41	69.43
RULER 序列长度 = 131072 niah_multikey_1	98.41	88.20	86.80
RULER 序列长度 = 131072 niah_multikey_2	94.73	83.60	79.20
RULER 序列长度 = 131072 niah_multikey_3	96.44	78.80	76.00
RULER 序列长度 = 131072 niah_multiquery	98.79	95.40	94.25
RULER 序列长度 = 131072 niah_multivalue	101.6	73.75	74.95
RULER 序列长度 = 131072 niah_single_1	100.0	100.00	100.0
RULER 序列长度 = 131072 niah_single_2	100.0	99.80	99.80
RULER 序列长度 = 131072 niah_single_3	100.2	99.80	100.0
RULER 序列长度 = 131072 ruler_cwe	87.39	39.42	33.14
RULER 序列长度 = 131072 ruler_fwe	98.13	92.93	91.20
RULER 序列长度 = 131072 ruler_qa_hotpot	100.4	48.20	48.40
RULER 序列长度 = 131072 ruler_qa_squad	96.22	53.57	51.55
RULER 序列长度 = 131072 ruler_qa_vt	98.82	92.28	91.20
RULER 序列长度 = 131072 平均得分	98.16	80.44	78.96