库名称:transformers
许可证:apache-2.0
语言:
- 英文
数据集:
- HuggingFaceFW/fineweb-edu
HuggingFaceFW/ablation-model-fineweb-edu 模型卡
模型概述
该模型是FineWeb消融实验的一部分,详情参见技术报告。
模型参数为18.2亿,上下文长度为2048,采用带RoPE的Llama架构。训练数据为来自FineWeb-Edu的3500亿token,使用gpt2
分词器进行分词。
- 论文:üç∑ FineWeb:从网络中提炼大规模优质文本数据 https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- 许可证:Apache-2
- 语言:英文
使用方法
预期用途
该模型基于英文网络数据训练,未经指令微调,适用于英文文本补全任务。
需注意,该模型的主要用途是与相同训练条件下的其他模型进行性能对比,并不代表该数据集能达到的最佳效果。
生成示例
from transformers import AutoModelForCausalLM, AutoTokenizer
model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)
inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
中间检查点(即将发布)
我们将按每1000训练步长发布中间检查点,存放于独立分支。命名规则为step-001000-2BT
。
可通过transformers
的revision
参数加载特定版本:
model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")
可通过以下代码查看所有版本:
from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])
训练详情
模型配置
- 架构:Llama模型
- 预训练步数:167k
- 预训练token数:3500亿
- 精度:bfloat16
硬件
- GPU:64块H100
- 训练时长:72小时(实际耗时)
软件
评估
使用lighteval
统一评估所有消融模型。复现结果需遵循此处说明:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
--custom_tasks "lighteval_tasks.py" --output_dir [输出路径] --max_samples 1000 \
--tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"
特别注意:MMLU提示词与lm-evaluation-harness
和Open LLM Leaderboard中的版本略有不同(详见博客)。我们采用的提示模板能为小模型和非指令微调模型提供更有效的信号。
局限性
该模型主要基于英文数据训练,可能在其他语言上表现欠佳。同时模型行为受训练数据质量和多样性的影响,可能包含偏见或有害内容。