Swallow-7b-instruct-hf开源大语言模型 - 优化Llama 2，提升日语指令跟随能力

首页

Swallow 7b Instruct Hf

由 tokyotech-llm 开发

基于Llama 2系列优化的日语增强型大语言模型，通过监督式微调提升指令跟随能力

大型语言模型

Transformers

支持多种语言#日语优化大模型 #多语言指令跟随 #学术研究专用

下载量 1,938

发布时间 : 12/7/2023

模型简介

燕子模型是东京工业大学LLM团队开发的日语优化大语言模型，在Llama 2基础上通过持续预训练和指令微调增强日语处理能力，支持日英双语任务。

模型特点

日语优化词表

扩展了日语专用token，显著提升日语文本编码效率

双语支持

同时支持日语和英语任务处理

指令微调

通过监督式微调(SFT)增强指令理解和执行能力

模型能力

日语文本生成

英语文本生成

常识推理

开放问答

阅读理解

摘要生成

数学推理

机器翻译

使用案例

教育

日语学习助手

帮助学生理解日语语法和词汇

在JCommonsenseQA日语常识测试中准确率达48.08%

内容创作

日语文章生成

根据提示生成连贯的日语文章

在XL-Sum摘要任务中得分18.30%

翻译服务

日英互译

实现日语和英语之间的互译

WMT20英日翻译BLEU得分25.10%

🚀 Swallow

Swallow模型是基于Llama 2家族进行持续预训练的成果，主要增加了日语数据。微调版本采用了监督微调（SFT）技术。其他模型的链接可在索引中找到。

🚀 快速开始

首先，安装requirements.txt文件中的额外依赖项：

pip install -r requirements.txt

使用指令模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大学の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

使用基础模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大学の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

✨ 主要特性

多语言支持：支持日语和英语两种语言。
高效推理：采用基于日语数据扩展词汇的分词器，能以更少的标记更高效地表示文本，推理速度更快。
性能提升：在多个日语和英语任务上的表现优于Llama 2基础模型。

📦 安装指南

安装requirements.txt文件中的额外依赖项：

pip install -r requirements.txt

💻 使用示例

基础用法

使用指令模型的示例代码：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大学の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

高级用法

使用基础模型的示例代码：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大学の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

📚 详细文档

模型发布更新

2024年4月26日：发布增强指令微调模型的0.1版本：Swallow-7b-instruct-v0.1、Swallow-13b-instruct-v0.1和Swallow-70b-instruct-v0.1作为预览版本。
2024年3月2日：发布Swallow-7b-plus-hf，该模型使用的日语标记数量约为Swallow-7b-hf的两倍。
2024年2月4日：发布Swallow-13b-NVE-hf。
2024年1月26日：发布Swallow-7b-NVE-hf、Swallow-7b-NVE-instruct-hf、Swallow-70b-NVE-hf和Swallow-70b-NVE-instruct-hf。
2023年12月19日：发布Swallow-7b-hf、Swallow-7b-instruct-hf、Swallow-13b-hf、Swallow-13b-instruct-hf、Swallow-70b-hf和Swallow-70b-instruct-hf。

Swallow模型索引

模型	Swallow-hf	Swallow-instruct-hf	Swallow-instruct-v0.1
7B	链接	链接	链接
7B-Plus	链接	无	无
13B	链接	链接	链接
70B	链接	链接	链接

Swallow模型索引NVE（无词汇扩展）

模型	Swallow-NVE-hf	Swallow-NVE-instruct-hf
7B	链接	链接
13B	链接	无
70B	链接	链接

模型详情

属性	详情
模型类型	请参考LLaMA - 2技术报告了解模型架构详情。
语言	日语、英语
库	Megatron-LM
分词器	该模型采用基于日语数据扩展词汇的分词器，能以更少的标记更高效地表示文本，推理速度明显更快。
联系方式	swallow[at]nlp.c.titech.ac.jp

基础模型性能

日语任务

模型	规模	JCommonsenseQA（4-shot）	JEMHopQA（4-shot）	NIILC（4-shot）	JSQuAD（4-shot）	XL - Sum（1-shot）	MGSM（4-shot）	WMT20 - en - ja（4-shot）	WMT20 - ja - en（4-shot）
Llama 2	7B	0.3852	0.4240	0.3410	0.7917	0.1905	0.0760	0.1783	0.1738
Swallow	7B	0.4808	0.5078	0.5968	0.8573	0.1830	0.1240	0.2510	0.1511
Swallow - Plus	7B	0.5478	0.5493	0.6030	0.8544	0.1806	0.1360	0.2568	0.1441
Swallow - NVE	7B	0.5433	0.5425	0.5729	0.8684	0.2117	0.1200	0.2405	0.1512
Llama 2	13B	0.6997	0.4415	0.4170	0.8533	0.2139	0.1320	0.2146	0.1982
Swallow	13B	0.7837	0.5063	0.6398	0.9005	0.2168	0.2040	0.2720	0.1771
Swallow - NVE	13B	0.7712	0.5438	0.6351	0.9030	0.2294	0.2120	0.2735	0.1817
Llama 2	70B	0.8686	0.4656	0.5256	0.9080	0.2361	0.3560	0.2643	0.2398
Swallow	70B	0.9348	0.6290	0.6960	0.9176	0.2266	0.4840	0.3043	0.2298
Swallow - NVE	70B	0.9410	0.5759	0.7024	0.9254	0.2758	0.4720	0.3042	0.2322

英语任务

模型	规模	OpenBookQA（8-shot）	TriviaQA（8-shot）	HellaSwag（8-shot）	SQuAD2.0（8-shot）	XWINO（8-shot）	GSM8K（8-shot）
Llama 2	7B	0.3580	0.6265	0.5860	0.3207	0.9049	0.1410
Swallow	7B	0.3180	0.4836	0.5308	0.3125	0.8817	0.1130
Swallow - Plus	7B	0.3280	0.4558	0.5259	0.3134	0.8929	0.1061
Swallow - NVE	7B	0.3180	0.5079	0.5329	0.2919	0.8817	0.0986
Llama 2	13B	0.3760	0.7255	0.6148	0.3681	0.9140	0.2403
Swallow	13B	0.3500	0.5852	0.5660	0.3406	0.9075	0.2039
Swallow - NVE	13B	0.3460	0.6025	0.5700	0.3478	0.9006	0.1751
Llama 2	70B	0.4280	0.8239	0.6742	0.3770	0.9290	0.5284
Swallow	70B	0.4220	0.7756	0.6458	0.3745	0.9204	0.4867
Swallow - NVE	70B	0.4240	0.7817	0.6439	0.3451	0.9256	0.4943

评估基准

日语评估基准

使用了llm - jp - eval(v1.0.0)和JP Language Model Evaluation Harness(commit #9b42d41)，详情如下：

多项选择题回答（JCommonsenseQA [Kurihara +, 2022]）
开放式问题回答（JEMHopQA [Ishii +, 2023]）
开放式问题回答（NIILC [Sekine, 2003]）
机器阅读理解（JSQuAD [Kurihara +, 2022]）
自动摘要（XL - Sum [Hasan +, 2021]）
机器翻译（WMT2020 ja - en [Barrault +, 2020]）
机器翻译（WMT2020 en - ja [Barrault +, 2020]）
数学推理（MGSM [Shi +, 2023]）

英语评估基准

使用了Language Model Evaluation Harness(v.0.3.0)，详情如下：

多项选择题回答（OpenBookQA [Mihaylov +, 2018]）
开放式问题回答（TriviaQA [Joshi +, 2017]）
机器阅读理解（SQuAD 2.0 [Rajpurkar +, 2018]）
常识推理（XWINO [Tikhonov & Ryabinin, 2021]）
自然语言推理（HellaSwag [Zellers +, 2019]）
数学推理（GSM8k [Cobbe +, 2021]）

训练数据集

持续预训练

以下数据集用于持续预训练：

指令微调

以下数据集用于指令微调：

🔧 技术细节

Swallow模型基于Llama 2进行持续预训练，增加了日语数据，并采用监督微调（SFT）技术对模型进行微调。分词器基于日语数据扩展了词汇，提高了文本表示效率和推理速度。

📄 许可证

致谢

感谢Meta Research以开放许可证发布Llama 2，以便其他人基于其进行开发。

本项目得到了国立先进工业科学技术研究所ABCI大规模语言模型构建支持计划的支持。

作者

冈崎实验室

横田实验室

引用方式

如果您认为我们的工作有帮助，请引用：

@inproceedings{Fujii:COLM2024,
   title={Continual Pre-Training for Cross-Lingual LLM Adaptation:
Enhancing Japanese Language Capabilities},
   author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki
Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae
Mizuki and Rio Yokota and Naoaki Okazaki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}

@inproceedings{Okazaki:COLM2024,
   title={Building a Large Japanese Web Corpus for Large Language Models},
   author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki
Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay
Loem and Rio Yokota and Sakae Mizuki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}