rank_zephyr_7b_v1_full-GGUF开源文本排序模型

首页

Rank Zephyr 7b V1 Full GGUF

由 MaziyarPanahi 开发

基于 Mistral-7B 的文本排序模型，提供多种量化版本，适用于高效推理。

大型语言模型英语开源协议:MIT #文本排序优化 #多比特量化 #长序列处理

下载量 708

发布时间 : 2/3/2024

模型简介

该模型是 castorini/rank_zephyr_7b_v1_full 的 GGUF 格式版本，专门用于文本排序任务，支持多种量化级别以优化性能。

模型特点

多种量化选项

提供 2 比特到 8 比特的多种量化版本，可根据需求平衡精度和性能。

高效推理

采用 GGUF 格式，优化了推理效率，适合在各种硬件上运行。

文本排序能力

专门针对文本排序任务优化，能够高效处理相关任务。

模型能力

文本排序

高效推理

支持多种量化级别

使用案例

信息检索

搜索结果排序

对搜索引擎返回的结果进行排序，提升结果相关性。

推荐系统

推荐内容排序

对推荐系统中的候选内容进行排序，优化用户体验。

🚀 MaziyarPanahi/rank_zephyr_7b_v1_full - GGUF

本项目提供了 castorini/rank_zephyr_7b_v1_full 模型的 GGUF 格式文件，助力文本排序任务的高效执行。

🚀 快速开始

模型信息

模型创建者：castorini
原始模型：castorini/rank_zephyr_7b_v1_full

模型标签信息

属性	详情
模型类型	量化模型，包含 2 - 8 位量化、GGUF 格式、transformers 架构、safetensors 存储等特性
训练数据	未提及
基础模型	mistralai/Mistral - 7B - v0.1、castorini/rank_zephyr_7b_v1_full
许可证	MIT、Apache - 2.0
推理状态	不支持推理
管道标签	文本排序
量化者	MaziyarPanahi

✨ 主要特性

多种量化支持：涵盖 2 - 8 位量化方法，满足不同场景的性能与精度需求。
GGUF 格式：新一代模型格式，替代 GGML，获众多客户端和库支持。
多库兼容：与 llama.cpp、text - generation - webui 等多种工具和库兼容。

📦 安装指南

安装依赖库

若要使用 Python 调用模型，需安装相应库：

# Base ctransformers with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python

下载 GGUF 文件

手动下载注意事项

不建议克隆整个仓库，可按需选择量化格式文件下载。

自动下载工具

LM Studio、LoLLMS Web UI、Faraday.dev 等工具可自动下载模型。

在 text - generation - webui 中下载

在 Download Model 处输入模型仓库地址 [MaziyarPanahi/rank_zephyr_7b_v1_full - GGUF](https://huggingface.co/MaziyarPanahi/rank_zephyr_7b_v1_full - GGUF)，并指定文件名（如 rank_zephyr_7b_v1_full - GGUF.Q4_K_M.gguf），点击 Download。

命令行下载

使用 huggingface - hub 库下载单个文件：

pip3 install huggingface-hub
huggingface-cli download MaziyarPanahi/rank_zephyr_7b_v1_full-GGUF rank_zephyr_7b_v1_full-GGUF.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

下载多个文件：

huggingface-cli download [MaziyarPanahi/rank_zephyr_7b_v1_full-GGUF](https://huggingface.co/MaziyarPanahi/rank_zephyr_7b_v1_full-GGUF) --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf'

加速下载：

pip3 install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download MaziyarPanahi/rank_zephyr_7b_v1_full-GGUF rank_zephyr_7b_v1_full-GGUF.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Windows 用户可先运行 set HF_HUB_ENABLE_HF_TRANSFER = 1 再执行下载命令。

💻 使用示例

基础用法

llama.cpp 命令示例

确保使用 d0cee0d 或更新版本的 llama.cpp：

./main -ngl 35 -m rank_zephyr_7b_v1_full-GGUF.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"

参数说明：

-ngl：指定卸载到 GPU 的层数，无 GPU 加速可移除。
-c：设置序列长度，更长序列需更多资源，可按需调整。
-p：设置提示语，若要进行对话式交互，可替换为 -i -ins。

Python 代码示例

使用 llama - cpp - python 库加载模型：

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="./rank_zephyr_7b_v1_full-GGUF.Q4_K_M.gguf",  # Download the model file first
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

# Simple inference example
output = llm(
  "<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant", # Prompt
  max_tokens=512,  # Generate up to 512 tokens
  stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
  echo=True        # Whether to echo the prompt
)

# Chat Completion API

llm = Llama(model_path="./rank_zephyr_7b_v1_full-GGUF.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "Write a story about llamas."
        }
    ]
)

高级用法

使用 LangChain 集成

📚 详细文档

GGUF 格式说明

GGUF 是 llama.cpp 团队于 2023 年 8 月 21 日推出的新格式，用于替代不再受支持的 GGML。以下是已知支持 GGUF 的部分客户端和库：

llama.cpp：GGUF 源项目，提供 CLI 和服务器选项。
[text - generation - webui](https://github.com/oobabooga/text - generation - webui)：广泛使用的 Web UI，功能丰富，支持 GPU 加速。
KoboldCpp：全功能 Web UI，跨平台支持 GPU 加速，适合故事创作。
GPT4All：免费开源的本地运行 GUI，支持 Windows、Linux 和 macOS，全 GPU 加速。
LM Studio：适用于 Windows 和 macOS（Silicon）的易用强大本地 GUI，支持 GPU 加速，Linux 版处于测试阶段。
[LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui)：功能独特的 Web UI，含完整模型库，便于模型选择。
Faraday.dev：美观易用的基于角色的聊天 GUI，支持 Windows 和 macOS（Silicon 和 Intel），支持 GPU 加速。
[llama - cpp - python](https://github.com/abetlen/llama - cpp - python)：支持 GPU 加速、LangChain 和 OpenAI 兼容 API 服务器的 Python 库。
candle：注重性能的 Rust ML 框架，支持 GPU，易于使用。
ctransformers：支持 GPU 加速、LangChain 和 OpenAI 兼容 AI 服务器的 Python 库，但截至 2023 年 11 月 27 日，更新不及时，不支持部分新模型。

量化方法说明

点击查看详情

新的量化方法如下：

GGML_TYPE_Q2_K - “type - 1” 2 位量化，超级块含 16 个块，每个块 16 个权重。块尺度和最小值用 4 位量化，实际每位权重使用 2.5625 位（bpw）。
GGML_TYPE_Q3_K - “type - 0” 3 位量化，超级块含 16 个块，每个块 16 个权重。尺度用 6 位量化，最终每位权重使用 3.4375 bpw。
GGML_TYPE_Q4_K - “type - 1” 4 位量化，超级块含 8 个块，每个块 32 个权重。尺度和最小值用 6 位量化，最终每位权重使用 4.5 bpw。
GGML_TYPE_Q5_K - “type - 1” 5 位量化，与 GGML_TYPE_Q4_K 超级块结构相同，每位权重使用 5.5 bpw。
GGML_TYPE_Q6_K - “type - 0” 6 位量化，超级块含 16 个块，每个块 16 个权重。尺度用 8 位量化，最终每位权重使用 6.5625 bpw。

在 text - generation - webui 中运行

更多说明可参考 [text - generation - webui 文档](https://github.com/oobabooga/text - generation - webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp)。