Colossal-LLaMA-2-7b-base中英双语大模型 - 开源支持长文本交流需求

首页

Colossal LLaMA 2 7b Base

由 hpcai-tech 开发

基于LLaMA-2的中英双语开源大语言模型，经过约85亿token的持续预训练，支持4096 token的上下文窗口。

大型语言模型

Transformers

支持多种语言#中英双语支持 #低成本预训练 #大上下文窗口

下载量 147

发布时间 : 9/18/2023

模型简介

Colossal-LLaMA-2-7B-base是基于LLaMA-2的中英双语开源大语言模型，通过持续预训练增强中文能力，同时保持英语能力，适用于多种自然语言处理任务。

模型特点

低成本高效训练

仅用15小时和64块A800 GPU完成约85亿token的持续预训练，成本不到1000美元。

中英双语支持

增强LLaMA-2的中文能力，同时保持英语能力，支持中英双语任务。

长上下文窗口

支持4096 token的上下文窗口，适合处理长文本任务。

开源无商业限制

遵循LLaMA-2许可证和Apache 2.0许可证，无额外商业使用限制。

模型能力

文本生成

自然语言理解

中英双语处理

长文本处理

使用案例

通用自然语言处理

文本补全

根据给定的文本提示生成连贯的后续内容。

生成流畅、连贯的文本

问答系统

回答用户提出的问题，提供相关信息。

准确回答各类问题

教育

语言学习辅助

帮助学习者练习中英双语写作和阅读理解。

提供高质量的语言学习辅助

🚀 Colossal-LLaMA-2-7B

Colossal-LLaMA-2-7B是基于LLaMA-2开发的开源模型，经过持续预训练，在中英文评估指标上表现出色，且成本较低，可用于构建特定领域知识或任务的模型。

🚀 快速开始

加载模型

使用以下代码加载Colossal-LLaMA-2-7B-base模型：

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", trust_remote_code=True)
input = "明月松间照，\n\n->\n\n"
inputs = tokenizer(input, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs,
                        max_new_tokens=512,
                        do_sample=True,
                        temperature=0.3,
                        top_k=50,
                        top_p=0.95,
                        num_return_sequences=1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])

✨ 主要特性

开源模型：基于LLaMA-2开发的开源模型Colossal-LLaMA-2-7B-base。
成本效益高：经过约85亿个token的持续预训练，仅需64块A800 GPU训练15小时，成本不到1000美元，却能达到数百万美元从头预训练的效果。
多语言支持：支持中文和英文，上下文窗口达4096个token。
性能出色：在标准中英文评估指标（如C-Eval和MMLU）上表现优异。

📚 详细文档

模型介绍

Colossal-AI团队推出了开源模型Colossal-LLaMA-2-7B-base。该模型基于LLaMA-2，经过约85亿个token的持续预训练，使用64块A800 GPU训练了15小时。成本不到1000美元，却能达到数百万美元从头预训练的效果。它遵循LLaMA-2许可证和Apache 2.0许可证，无额外商业使用限制，可用于构建特定领域知识或任务的模型。

Colossal-LLaMA-2-7B-base支持中文和英文，上下文窗口达4096个token，在标准中英文评估指标（如C-Eval和MMLU）上表现出色。

性能评估

我们在4个数据集上进行了全面评估，并将Colossal-Llama-2-7b-base模型与多种模型进行了比较。

MMLU和CMMLU使用5-shot，根据第一个预测token的logits计算分数。
AGIEval使用5-shot，仅计算4选1问题的分数，采用精确匹配和第一个预测token的logits组合指标。
GAOKAO-Bench使用0-shot，仅根据第一个预测token的logits计算4选1问题的分数。
所有数据集的生成配置均为贪心搜索。
我们还提供了CEval分数，来自其最新排行榜或模型的官方仓库。

更多指标详情可参考Metrics。

属性	详情
模型类型	Colossal-LLaMA-2-7B-base
训练数据	约85亿个token

| 模型 | 骨干网络 | 消耗的token数 | MMLU (5-shot) | CMMLU (5-shot) | AGIEval (5-shot) | GAOKAO (0-shot) | CEval (5-shot) | | :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :----------------------------: | | | - | - | | | | | | | Baichuan-7B | - | 1.2T | 42.32 (42.30) | 44.53 (44.02) | 38.72 | 36.74 | 42.80 | | Baichuan2-7B-Base | - | 2.6T | 46.97 (54.16) | 57.67 (57.07) | 45.76 | 52.60 | 54.00 | | ChatGLM-6B | - | 1.0T | 39.67 (40.63) | 41.17 (-) | 40.10 | 36.53 | 38.90 | | ChatGLM2-6B | - | 1.4T | 44.74 (45.46) | 49.40 (-) | 46.36 | 45.49 | 51.70 | | InternLM-7B | - | - | 46.70 (51.00) | 52.00 (-) | 44.77 | 61.64 | 52.80 | | Qwen-7B (original) | - | 2.2T | 54.29 (56.70) | 56.03 (58.80) | 52.47 | 56.42 | 59.60 | | Qwen-7B | - | 2.4T | 58.33 (58.20) | 62.54 (62.20) | 64.34 | 74.05 | 63.50 | | Llama-2-7B | - | 2.0T | 44.47 (45.30) | 32.97 (-) | 32.60 | 25.46 | - | | Linly-AI/Chinese-LLaMA-2-7B-hf | Llama-2-7B | 1.0T | 37.43 | 29.92 | 32.00 | 27.57 | - | | wenge-research/yayi-7b-llama2 | Llama-2-7B | - | 38.56 | 31.52 | 30.99 | 25.95 | - | | ziqingyang/chinese-llama-2-7b | Llama-2-7B | - | 33.86 | 34.69 | 34.52 | 25.18 | 34.2 | | TigerResearch/tigerbot-7b-base | Llama-2-7B | 0.3T | 43.73 | 42.04 | 37.64 | 30.61 | - | | LinkSoul/Chinese-Llama-2-7b | Llama-2-7B | - | 48.41 | 38.31 | 38.45 | 27.72 | - | | FlagAlpha/Atom-7B | Llama-2-7B | 0.1T | 49.96 | 41.10 | 39.83 | 33.00 | - | | Colossal-LLaMA-2-7b-base | Llama-2-7B | 0.0085T | 53.06 | 49.89 | 51.48 | 58.82 | 50.20 |

括号内的分数对应模型官方仓库中的分数。

ChatGLM模型使用零样本。

评估Qwen-7B在MMLU数据集上的表现时，提示语为"xxx Answer:"（去掉":"后的空格），我们计算Qwen-7B在" A"、" B"、" C"和" D"上的logits。Qwen-7B的原始版本和更新版本都比其他模型更具确定性。例如，" A"的logits可能为-inf，softmax值为0。

对于其他模型和其他数据集，我们计算" A"、" B"、" C"和" D"上的logits。

❗️ 更多评估方法和结果复现的详细信息，请参考ColossalEval。

技术细节

数据

为了提高LLaMA-2理解和生成中文内容的能力，Colossal-AI团队提出使用中英文语料对LLaMA-2模型进行持续预训练。

大型语言模型（如LLaMA-2）使用多种高质量数据集进行训练，取得了不错的效果。提高LLaMA-2在中文语料上的性能，同时保持其英语能力，关键在于数据集的组成（包括中英文内容）和每个子数据集的质量。

下图展示了Colossal-LLaMA-2的数据处理流程：

❗️ 重要提示：我们即将开源数据处理工具包，请持续关注！

分词器

原始的LLaMA-2词表包含的中文字符少于1000个，无法有效编码完整的中文文本。其次，字节token的使用使Transformer编码器难以捕捉中文字符的语义细微差别。

为解决上述问题，我们将LLaMA-2的词表从32000扩展到69104。为使LLaMA-2模型适应Colossal-LLaMA-2分词器，我们通过计算原始LLaMA-2嵌入的均值来初始化新的词嵌入，并将这些新行添加到原始嵌入矩阵的末尾。

扩展词表大小的优点：

提高字符串序列编码的压缩率。
增强信息的完整性。
使编码序列包含更多有价值的信息，从而理论上提高章节级编码能力。

在低资源设置下大词表的优点：

大量未使用的token可能是由于训练数据集有限，过多的token可能未被有效学习。
过度扩展词表会增加嵌入相关参数，导致更高的内存使用，进而影响训练效率。

为平衡两者，我们最终构建了大小为69104的词表。下表展示了7B级别的各种模型的比较：

模型	词表大小	压缩率	样本平均长度（token级）
Colossal-LLaMA-2	69104	0.659	73.682
LLaMA-2-7B	32000	1.205	134.689
Atom-7B	65000	0.634	70.915
Baichuan-7B	64000	0.678	75.857
Baichuan2-7B-base	125696	0.570	63.761
Chatglm2-6B	64789	0.645	72.178
InternLM-7B	103168	0.566	63.349
Qwen-7B	151643	0.578	64.703
Tigerbot-7B-base	60515	0.630	70.515
Yayi-7B-llama2	32005	1.214	135.689
Chinese-llama-2-7b	55296	0.668	74.690
Chinese-Falcon-7B	90046	0.669	74.858
LinkSoul-Chinese-Llama-2-7b	40076	0.958	107.089
Ziya-LLaMA-13B-v1.1	39410	0.958	107.074

训练日志

以下是我们实验的训练日志：训练损失随步数变化训练损失随token数变化

训练策略

多阶段训练

为提高模型性能并充分发挥原始LLaMA-2的潜力，我们开发了多阶段训练策略，旨在分阶段系统地解锁模型能力。

因此，我们将训练过程分为三个阶段：

大规模预训练阶段（由LLaMA-2完成）：此初始阶段旨在从头建立模型的基础能力，需要使用不少于1万亿个token的大量数据集。
中文知识注入阶段：在此阶段，我们向模型中引入中文知识，需要访问富含中文综合知识的高质量数据集。
知识回放阶段：通过问答（QA）机制回放知识，涵盖中英文领域。

完成多阶段训练后，模型在中英文基准测试中的性能均有显著提升。

下图展示了Colossal-LLaMA-2的三个训练阶段：多阶段训练

基于桶的训练

我们的实验表明，训练数据集中的分布以及各种主题相关数据点的排列，对模型的整体性能有显著影响，特别是在LLaMA-2的持续预训练中。

为实现更平衡的分布并控制数据集的顺序，我们采用了将每个子数据集划分为离散箱的方法。这些箱然后组合成单个数据桶，每个子数据集贡献一个箱。

更多细节请参考我们的Github。

局限性

Colossal-LLaMA-2-7B是LLaMA-2的衍生模型，使用时存在风险。到目前为止，测试仅在中英文语言中进行，需要承认它无法涵盖所有可能的场景。与其他大语言模型一样，无法提前预测Colossal-LLaMA-2-7B-base的潜在结果。在某些情况下，Colossal-LLaMA-2-7B-base可能会生成不准确、有偏见或有害的响应。因此，在部署任何由Colossal-LLaMA-2-7B-base驱动的应用程序之前，开发人员必须进行安全测试和调优，使模型满足其应用程序的特定要求。

引用

@article{bian2021colossal,
    title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
    author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
    journal={arXiv preprint arXiv:2110.14883},
    year={2021}
}

@misc{touvron2023llama,
    title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, 
    author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
    year={2023},
    eprint={2307.09288},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@article{dao2023flashattention2,
    title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
    author={Dao, Tri},
    year={2023}
}