ablation-model-fineweb-edu开源英文文本补全模型

首页

Ablation Model Fineweb Edu

由 HuggingFaceFW 开发

该模型是FineWeb消融实验的一部分，参数为18.2亿，基于Llama架构，使用FineWeb-Edu数据集训练，适用于英文文本补全任务。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #消融实验模型 #英文文本补全 #Llama架构

下载量 262

发布时间 : 5/29/2024

模型简介

该模型是用于研究FineWeb数据集效果的消融实验模型，主要用于英文文本生成和补全任务，未经指令微调。

模型特点

消融实验模型

专门设计用于研究FineWeb数据集不同配置对模型性能的影响

大上下文窗口

支持2048 tokens的上下文长度

透明训练过程

提供每1000训练步的中间检查点，便于研究训练动态

模型能力

英文文本生成

文本补全

语言模型研究

使用案例

研究用途

数据集消融研究

用于比较不同数据预处理方法对模型性能的影响

文本生成

英文文本补全

根据给定前缀生成连贯的后续文本

🚀 HuggingFaceFW/ablation-model-fineweb-edu 模型卡

本模型是基于Transformer架构开发的语言模型，主要用于英文文本补全任务。通过在特定英文数据集上训练，该模型可用于与其他相同条件下训练的模型进行性能比较。

🚀 快速开始

安装依赖

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

加载特定版本模型

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

获取所有模型版本

from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])

✨ 主要特性

模型参数：拥有18.2亿个参数，上下文长度为2048。
架构：采用Llama架构并使用RoPE。
训练数据：在包含3500亿个标记的 FineWeb-Edu 数据集上进行训练，使用gpt2分词器进行分词。

📦 安装指南

运行代码示例前，请确保安装transformers库：

pip install -q transformers

💻 使用示例

基础用法

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

高级用法

加载特定版本的模型：

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

📚 详细文档

预期用途

此模型在英文网络数据上进行训练，且未经过指令微调，主要用于英文文本补全。需要注意的是，该模型的主要预期用例是与在相同条件下训练的其他模型进行性能比较，它不一定是给定数据集所能达到的最佳结果。

中间检查点（即将发布）

我们将以每1000个训练步骤为间隔，在单独的分支中发布该模型的中间检查点。命名约定为 step-001000-2BT。

训练

模型

属性	详情
模型类型	Llama模型
预训练步骤	16.7万步
预训练标记	3500亿个
精度	bfloat16

硬件

属性	详情
GPU	64个H100
训练时间	72小时

软件

nanotron 用于训练
datatrove 用于分词
lighteval 用于评估

评估

我们使用lighteval以相同的设置评估所有消融模型。要重现我们的结果，请确保遵循此处的说明。

# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
    --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \ 
    --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"

特别地，MMLU提示与lm-evaluation-harness和开放大语言模型排行榜中的提示略有不同，更多信息请参阅此博客文章。我们使用的提示模板为小型且未经过指令微调的模型提供了更好的信号。