DeepSeek LLM 7B Base AWQ开源大语言模型 - 免费部署高效推理问答

首页

Deepseek Llm 7B Base AWQ

由 TheBloke 开发

Deepseek LLM 7B Base 是一个7B参数规模的基础大语言模型，采用AWQ量化技术优化推理效率。

大型语言模型

Transformers

开源协议:其他 #4位量化推理 #高效Transformer #长文本处理

下载量 1,863

发布时间 : 11/29/2023

模型简介

该模型是DeepSeek开发的7B参数基础语言模型，支持高效的4位量化推理，适用于多种文本生成任务。

模型特点

高效量化

采用AWQ 4位量化技术，在保持模型质量的同时显著提升推理速度

长上下文支持

支持长达4096 tokens的上下文长度

多平台兼容

支持文本生成Web界面、vLLM、Hugging Face TGI等多种推理平台

模型能力

文本生成

问答系统

内容创作

代码生成

使用案例

内容创作

故事创作

生成连贯的短篇故事或小说章节

可生成符合逻辑且风格一致的叙事内容

问答系统

知识问答

回答用户提出的各类知识性问题

能提供准确且上下文相关的答案

🚀 DeepSeek LLM 7B Base - AWQ

本项目包含了 DeepSeek 的 DeepSeek LLM 7B Base 的 AWQ 模型文件。这些文件使用了 Massed Compute 慷慨提供的硬件进行量化。

🚀 快速开始

模型信息

属性	详情
模型创建者	DeepSeek
原始模型	DeepSeek LLM 7B Base

可用仓库

提示模板

{prompt}

✨ 主要特性

关于 AWQ

AWQ 是一种高效、准确且极快的低比特权重量化方法，目前支持 4 比特量化。与 GPTQ 相比，在基于 Transformer 的推理中，它能提供更快的速度，并且在质量上与最常用的 GPTQ 设置相当或更优。

它支持以下应用：

Text Generation Webui - 使用加载器：AutoAWQ
vLLM - 仅支持 Llama 和 Mistral 模型
Hugging Face Text Generation Inference (TGI)
Transformers 版本 4.35.0 及更高版本，适用于任何支持 Transformers 的代码或客户端
AutoAWQ - 用于 Python 代码

📦 安装指南

提供的文件和 AWQ 参数

目前仅发布 128g GEMM 模型。正在积极考虑添加组大小为 32 的模型和 GEMV 内核模型。

模型以分片的 safetensors 文件形式发布。

分支	比特数	组大小	AWQ 数据集	序列长度	大小
main	4	128	VMware Open Instruct	4096	4.83 GB

在 text-generation-webui 中轻松下载和使用此模型

请确保使用的是 text-generation-webui 的最新版本。强烈建议使用 text-generation-webui 的一键安装程序，除非你确定知道如何手动安装。

点击模型选项卡。
在 下载自定义模型或 LoRA 下，输入 TheBloke/deepseek-llm-7B-base-AWQ。
点击下载。
模型将开始下载。下载完成后会显示“已完成”。
在左上角，点击模型旁边的刷新图标。
在模型下拉菜单中，选择你刚刚下载的模型：deepseek-llm-7B-base-AWQ
选择 加载器：AutoAWQ。
点击加载，模型将加载并准备好使用。
如果你需要任何自定义设置，请进行设置，然后点击右上角的 保存此模型的设置，接着点击 重新加载模型。
准备好后，点击 文本生成 选项卡并输入提示以开始！

使用 vLLM 进行多用户推理服务器部署

有关安装和使用 vLLM 的文档请点击此处。

请确保使用的是 vLLM 版本 0.2 或更高版本。
使用 vLLM 作为服务器时，请传递 --quantization awq 参数。

例如：

python3 -m vllm.entrypoints.api_server --model TheBloke/deepseek-llm-7B-base-AWQ --quantization awq --dtype auto

使用 Python 代码调用 vLLM

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/deepseek-llm-7B-base-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 Hugging Face Text Generation Inference (TGI) 进行多用户推理服务器部署

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器为：ghcr.io/huggingface/text-generation-inference:1.1.0

示例 Docker 参数：

--model-id TheBloke/deepseek-llm-7B-base-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

示例 Python 代码与 TGI 交互（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

使用 Transformers 从 Python 代码进行推理

安装必要的包

需要 Transformers 4.35.0 或更高版本。
需要 AutoAWQ 0.1.6 或更高版本。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

注意：如果你使用的是 PyTorch 2.0.1，上述 AutoAWQ 命令将自动将你升级到 PyTorch 2.1.0。

如果你使用的是 CUDA 11.8 并希望继续使用 PyTorch 2.0.1，请运行以下命令：

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

如果你在使用预构建的轮子安装 AutoAWQ 时遇到问题，请从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers 示例代码（需要 Transformers 4.35.0 及更高版本）

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/deepseek-llm-7B-base-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

💻 使用示例

基础用法

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-7b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

📚 详细文档

兼容性

提供的文件经过测试，可与以下应用兼容：

使用 Loader: AutoAWQ 的 text-generation-webui
版本 0.2.0 及更高版本的 vLLM
版本 1.1.0 及更高版本的 Hugging Face Text Generation Inference (TGI)
版本 4.35.0 及更高版本的 Transformers
版本 0.1.1 及更高版本的 AutoAWQ

📄 许可证

此代码仓库遵循 MIT 许可证。使用 DeepSeek LLM 模型需遵循模型许可证。DeepSeek LLM 支持商业使用。

更多详细信息请参阅 LICENSE-MODEL。

🔗 联系方式

如果你有任何问题，请提出问题或通过 service@deepseek.com 联系我们。

💬 Discord

如需进一步支持，以及讨论这些模型和人工智能相关话题，请加入我们的 TheBloke AI 的 Discord 服务器。

🙏 致谢与贡献方式

感谢 chirper.ai 团队！感谢来自 gpus.llm-utils.org 的 Clay！

很多人询问是否可以进行贡献。我喜欢提供模型并帮助他人，也希望能够花更多时间做这件事，同时拓展到新的项目，如微调/训练。

如果你有能力且愿意贡献，将不胜感激，这将帮助我继续提供更多模型，并开展新的人工智能项目。

捐赠者将在任何人工智能/大语言模型/模型相关的问题和请求上获得优先支持，访问私人 Discord 房间，以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特别感谢：Aemon Algiz。

Patreon 特别提及：Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius

感谢所有慷慨的赞助者和捐赠者！再次感谢 a16z 的慷慨资助。