Yarn-Mistral-7B-128k-AWQ开源语言模型 - 支持128k长上下文窗口对话交流

首页

Yarn Mistral 7B 128k AWQ

由 TheBloke 开发

Yarn Mistral 7B 128K是一款针对长上下文优化的先进语言模型，通过YaRN扩展方法在长上下文数据上进一步预训练，支持128k令牌的上下文窗口。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #128k长上下文 #高效推理优化 #英文文本生成

下载量 483

发布时间 : 11/2/2023

模型简介

基于Mistral-7B-v0.1扩展的语言模型，专门针对处理长上下文进行了优化，适用于需要处理超长文本的各种自然语言处理任务。

模型特点

超长上下文支持

支持128k令牌的上下文窗口，能够处理超长文本内容。

高效量化

提供AWQ量化版本，在保持质量的同时提高推理效率。

优化预训练

通过YaRN方法在长上下文数据上进行了1500步的额外预训练。

模型能力

长文本生成

上下文理解

文本续写

问答系统

使用案例

文档处理

长文档摘要

对超长文档进行内容摘要和关键信息提取。

法律文档分析

处理和分析复杂的法律合同和条款。

代码处理

代码库分析

理解大型代码库的结构和功能。

🚀 Yarn Mistral 7B 128K - AWQ

Yarn Mistral 7B 128K - AWQ 是经过量化处理的模型文件，基于 NousResearch 的 Yarn Mistral 7B 128K 模型。AWQ 量化方法高效、准确且推理速度快，支持多种推理工具，为用户提供了便捷的使用体验。

🚀 快速开始

模型信息

属性	详情
模型创建者	NousResearch
原始模型	Yarn Mistral 7B 128K
模型类型	Mistral
训练数据	emozilla/yarn-train-tokenized-16k-mistral
许可证	apache - 2.0
评估指标	困惑度（perplexity）
量化者	TheBloke
提示模板	`{prompt}`

模型仓库

提示模板

{prompt}

✨ 主要特性

关于 AWQ

AWQ 是一种高效、准确且极快的低比特权重量化方法，目前支持 4 比特量化。与 GPTQ 相比，它在基于 Transformer 的推理中速度更快，并且在质量上与最常用的 GPTQ 设置相当或更好。

它得到以下工具的支持：

Text Generation Webui - 使用加载器：AutoAWQ
vLLM - 仅支持 Llama 和 Mistral 模型
Hugging Face Text Generation Inference (TGI)
AutoAWQ - 用于 Python 代码调用

📦 安装指南

在 text - generation - webui 中使用

请确保你使用的是 text - generation - webui 的最新版本。强烈建议使用 text - generation - webui 的一键安装程序，除非你确定自己知道如何手动安装。

点击 Model 标签。
在 Download custom model or LoRA 下，输入 TheBloke/Yarn-Mistral-7B-128k-AWQ。
点击 Download。
模型将开始下载。下载完成后会显示 "Done"。
在左上角，点击 Model 旁边的刷新图标。
在 Model 下拉菜单中，选择你刚刚下载的模型：Yarn-Mistral-7B-128k-AWQ。
选择 Loader: AutoAWQ。
点击 Load，模型将加载并准备好使用。
如果你需要任何自定义设置，请设置它们，然后点击右上角的 Save settings for this model，接着点击 Reload the Model。
准备好后，点击 Text Generation 标签并输入提示以开始使用！

使用 AutoAWQ 从 Python 代码进行推理

安装 AutoAWQ 包

需要 AutoAWQ 0.1.1 或更高版本。

pip3 install autoawq

如果你在使用预构建的轮子安装 AutoAWQ 时遇到问题，请从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用示例

基础用法

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/Yarn-Mistral-7B-128k-AWQ"

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# 加载模型
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=True, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# 生成输出
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# 获取输出的令牌，解码并打印
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

高级用法

使用 vLLM 进行多用户推理服务

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Yarn-Mistral-7B-128k-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# 打印输出
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 Hugging Face Text Generation Inference (TGI) 进行多用户推理服务

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

📚 详细文档

提供的文件和 AWQ 参数

在首次发布 AWQ 模型时，仅发布 128g 模型。如果有需求，并且在完成困惑度和评估比较后，会考虑添加 32g 模型。但目前 32g 模型尚未在 AutoAWQ 和 vLLM 中进行全面测试。

模型以分片的 safetensors 文件形式发布。

分支	比特数	分组大小（GS）	AWQ 数据集	序列长度	大小
main	4	128	wikitext	4096	4.15 GB

兼容性

提供的文件经过测试，可与以下工具配合使用：

text - generation - webui，使用 Loader: AutoAWQ。
vLLM 版本 0.2.0 及更高版本。
Hugging Face Text Generation Inference (TGI) 版本 1.1.0 及更高版本。
AutoAWQ 版本 0.1.1 及更高版本。

🔧 技术细节

原始模型基准测试

长上下文基准测试

模型	上下文窗口	8k 困惑度	16k 困惑度	32k 困惑度	64k 困惑度	128k 困惑度
Mistral - 7B - v0.1	8k	2.96	-	-	-	-
Yarn - Mistral - 7b - 64k	64k	3.04	2.65	2.44	2.20	-
Yarn - Mistral - 7b - 128k	128k	3.08	2.68	2.47	2.24	2.19

短上下文基准测试（显示质量下降最小）

模型	上下文窗口	ARC - c	Hellaswag	MMLU	Truthful QA
Mistral - 7B - v0.1	8k	59.98	83.31	64.16	42.15
Yarn - Mistral - 7b - 64k	64k	59.38	81.21	61.32	42.50
Yarn - Mistral - 7b - 128k	128k	58.87	80.58	60.64	42.46

协作人员

bloc97：方法、论文和评估
@theemozilla：方法、论文、模型训练和评估
@EnricoShippole：模型训练
honglu2875：论文和评估

作者感谢 LAION AI 对该模型计算资源的支持。该模型在 [JUWELS](https://www.fz - juelich.de/en/ias/jsc/systems/supercomputers/juwels) 超级计算机上进行训练。

📄 许可证

本项目采用 apache - 2.0 许可证。

其他信息

Discord

如需进一步支持，以及讨论这些模型和人工智能相关话题，请加入：TheBloke AI 的 Discord 服务器

感谢与贡献方式

感谢 chirper.ai 团队！感谢来自 [gpus.llm - utils.org](llm - utils) 的 Clay！

很多人询问是否可以进行贡献。我喜欢提供模型并帮助他人，也希望能够花更多时间做这些事情，以及开展新的项目，如微调/训练。

如果你有能力并愿意贡献，我将非常感激，这将有助于我继续提供更多模型，并开始新的人工智能项目。

捐赠者将在所有 AI/LLM/模型问题和请求上获得优先支持，访问私人 Discord 房间，以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko - Fi: https://ko - fi.com/TheBlokeAI

特别感谢：Aemon Algiz。

Patreon 特别提及：Brandon Frisco、LangChain4j、Spiking Neurons AB、transmissions 11、Joseph William Delisle、Nitin Borwankar、Willem Michiel、Michael Dempsey、vamX、Jeffrey Morgan、zynix、jjj、Omer Bin Jawed、Sean Connelly、jinyuan sun、Jeromy Smith、Shadi、Pawan Osman、Chadd、Elijah Stavena、Illia Dulskyi、Sebastain Graf、Stephen Murray、terasurfer、Edmond Seymore、Celu Ramasamy、Mandus、Alex、biorpg、Ajan Kanaga、Clay Pascal、Raven Klaugh、阿明、K、ya boyyy、usrbinkat、Alicia Loh、John Villwock、ReadyPlayerEmma、Chris Smitley、Cap'n Zoog、fincy、GodLy、S_X、sidney chen、Cory Kujawski、OG、Mano Prime、AzureBlack、Pieter、Kalila、Spencer Kim、Tom X Nguyen、Stanislav Ovsiannikov、Michael Levine、Andrey、Trailburnt、Vadim、Enrico Ros、Talal Aujan、Brandon Phillips、Jack West、Eugene Pentland、Michael Davis、Will Dee、webtim、Jonathan Leane、Alps Aficionado、Rooh Singh、Tiffany J. Kim、theTransient、Luke @flexchar、Elle、Caitlyn Gatomon、Ari Malik、subjectnull、Johann - Peter Hartmann、Trenton Dambrowitz、Imad Khwaja、Asp the Wyvern、Emad Mostaque、Rainer Wilmers、Alexandros Triantafyllidis、Nicholas、Pedro Madruga、SuperWojo、Harry Royden McLaughlin、James Bentley、Olakabola、David Ziegler、Ai Maven、Jeff Scroggin、Nikolai Manek、Deo Leter、Matthew Berman、Fen Risland、Ken Nordquist、Manuel Alberto Morcote、Luke Pendergrass、TL、Fred von Graf、Randy H、Dan Guido、NimbleBox.ai、Vitor Caleffi、Gabriel Tamborski、knownsqashed、Lone Striker、Erik Bjäreholt、John Detwiler、Leonard Tan、Iucharbius