AI21-Jamba-Mini-1.5开源模型 - 高效处理长文本且推理快速的实用工具

首页

AI21 Jamba Mini 1.5

由 ai21labs 开发

AI21 Jamba 1.5 Mini 是一款先进的混合SSM-Transformer指令跟随基础模型，具有高效的长上下文处理能力和快速的推理速度。

大型语言模型

Transformers

支持多种语言开源协议:其他 #256K长上下文 #混合SSM-Transformer架构 #多语言文本生成

下载量 6,102

发布时间 : 8/19/2024

模型简介

Jamba 1.5 Mini 是市场上最强大且高效的长上下文模型之一，其推理速度比同类领先模型快达2.5倍。它展示了卓越的长上下文处理能力、速度和质量，是首个成功扩展到市场领先模型质量和强度的非Transformer模型。

模型特点

高效的长上下文处理

支持高达256K的上下文长度，能够处理超长文本输入。

快速的推理速度

推理速度比同类领先模型快达2.5倍。

混合SSM-Transformer架构

结合了SSM和Transformer的优势，提供高效且强大的模型性能。

多语言支持

支持英语、法语、德语、荷兰语、西班牙语、葡萄牙语、意大利语、阿拉伯语和希伯来语。

优化的商业用例

针对函数调用、结构化输出（JSON）和基于事实的生成等商业用例进行了优化。

模型能力

文本生成

长上下文处理

多语言文本生成

函数调用

结构化输出（JSON）

基于事实的生成

使用案例

商业应用

函数调用

支持根据用户请求调用外部函数，实现自动化任务。

高效且准确的函数调用能力。

结构化输出

生成JSON格式的结构化输出，便于程序处理。

输出格式规范且易于解析。

多语言应用

多语言文本生成

支持多种语言的文本生成任务。

高质量的多语言文本输出。

长文本处理

长文档摘要

处理长达256K token的长文档并生成摘要。

高效且准确的摘要生成能力。

🚀 AI21 Jamba 1.5模型

AI21 Jamba 1.5是一系列先进的基础模型，具备高效的长上下文处理能力和出色的性能。它们在多种语言和任务上表现优异，适用于商业场景，如函数调用、结构化输出等。

🚀 快速开始

请注意，此版本将于2024年5月6日弃用。我们建议您过渡到新版本，可点击此处查看。

✨ 主要特性

先进架构：AI21 Jamba 1.5系列模型是最先进的混合SSM - Transformer指令跟随基础模型。
高效推理：是市场上最强大、最高效的长上下文模型，推理速度比同类领先模型快达2.5倍。
多语言支持：支持英语、西班牙语、法语、葡萄牙语、意大利语、荷兰语、德语、阿拉伯语和希伯来语。
商业优化：针对商业用例和功能进行了优化，如函数调用、结构化输出（JSON）和基于文档的生成。
灵活授权：根据Jamba开放模型许可证发布，允许在许可条款下进行全面的研究和商业使用。

📦 安装指南

运行优化的Mamba实现

要运行优化的Mamba实现，首先需要安装mamba-ssm和causal-conv1d：

pip install mamba-ssm causal-conv1d>=1.2.0

同时，模型需要部署在CUDA设备上。

安装vLLM

使用vLLM进行高效推理，需要安装vLLM（要求版本0.5.4或更高）：

pip install vllm>=0.5.4

使用ExpertsInt8量化

使用ExpertsInt8量化技术，需要安装vllm版本0.5.5或更高：

pip install vllm>=0.5.5

安装`transformers`

使用transformers库时，确保不使用4.44.0和4.44.1版本，因为这些版本存在限制Jamba架构运行的bug。

安装`trl`进行微调

使用SFTTrainer进行微调时，需要安装trl：

pip install trl

安装`bitsandbytes`进行4位量化

使用QLoRA进行微调时，需要安装bitsandbytes：

pip install bitsandbytes

💻 使用示例

使用vLLM运行模型

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Mini"
number_gpus = 2

llm = LLM(model=model,
          max_model_len=200*1024,
          tensor_parallel_size=number_gpus)

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100) 
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
#Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

使用ExpertsInt8量化运行模型

import os
os.environ['VLLM_FUSED_MOE_CHUNK_SIZE']='32768'    # This is a workaround a bug in vLLM's fused_moe kernel

from vllm import LLM
llm = LLM(model="ai21labs/AI21-Jamba-1.5-Mini",
          max_model_len=100*1024,
          quantization="experts_int8")

使用`transformers`运行模型

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

以8位精度加载模型

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config)

在CPU上加载模型

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             use_mamba_kernels=False)

工具使用示例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

将工具响应反馈给模型

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

# Note that you must send the tool responses in the same order as the model called the tools:
messages = [
    {
        "role": "user",
        "content": "What's the weather like right now in Jerusalem and in London?"
    },
    {
        "role": "assistant",
        "content": null,
        "tool_calls": [
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"Jerusalem\", \"format\": \"celsius\"}"
            },
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"London\", \"format\": \"celsius\"}"
            }
        ]
    },
    {
        "role": "tool",
        "content": "The weather in Jerusalem is 18 degrees celsius."
    },
    {
        "role": "tool",
        "content": "The weather in London is 8 degrees celsius."
    }
]

tool_use_prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

将文档附加到Jamba 1.5提示

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
        {
            "role": "user",
            "content": "Who wrote Harry Potter?"
        }
]

documents = [
        {
            "text": "Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling.",
            "title": "Harry Potter"
        },
        {
            "text": "The Great Gatsby is a novel by American writer F. Scott Fitzgerald.",
            "title": "The Great Gatsby",
            "country": "United States",
            "genre": "Novel"

        }
]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    tokenize=False,
)

# Output: J. K. Rowling

使用JSON模式

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")
messages = [
    {'role':'user', 
     'content':'Describe the first American president. Include year of birth (number) and name (string).'}
    ]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=False,
                                       knobs={"response_format": "json_object", "is_set": True})

#Output: "{ "year of birth": 1732, "name": "George Washington." }"

📚 详细文档

模型详情

属性	详情
开发者	AI21
模型类型	联合注意力和Mamba（Jamba）
许可证	Jamba开放模型许可证
上下文长度	256K
知识截止日期	2024年3月5日
支持语言	英语、西班牙语、法语、葡萄牙语、意大利语、荷兰语、德语、阿拉伯语和希伯来语

常见基准测试结果

基准测试	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM - 8K	75.8	87
RealToxicity（越低越好）	8.1	6.7
TruthfulQA	54.1	58.3

RULER基准测试 - 有效上下文长度

模型	声明长度	有效长度	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT - 4 1106 - preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R - plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--