pipeline_tag: 文本生成
inference:
parameters:
temperature: 0.2
top_p: 0.95
widget:
- text: 'def print_hello_world():'
example_title: Hello world
group: Python
datasets:
- bigcode/the-stack-v2-train
license: bigcode-openrail-m
library_name: transformers
tags:
- 代码
model-index:
- name: starcoder2-15b
results:
- task:
type: 文本生成
dataset:
name: CruxEval-I
type: cruxeval-i
metrics:
- task:
type: 文本生成
dataset:
name: DS-1000
type: ds-1000
metrics:
- task:
type: 文本生成
dataset:
name: GSM8K (PAL)
type: gsm8k-pal
metrics:
- task:
type: 文本生成
dataset:
name: HumanEval+
type: humanevalplus
metrics:
- task:
type: 文本生成
dataset:
name: HumanEval
type: humaneval
metrics:
- task:
type: 文本生成
dataset:
name: RepoBench-v1.1
type: repobench-v1.1
metrics:
StarCoder2
目录
- 模型概述
- 使用
- 局限性
- 训练
- 许可证
- 引用
模型概述
StarCoder2-15B模型是一个拥有150亿参数的模型,训练数据涵盖600多种编程语言,源自The Stack v2,并排除了选择退出的请求。该模型采用分组查询注意力机制,拥有16,384个标记的上下文窗口和4,096个标记的滑动窗口注意力,并在超过4万亿标记的数据上使用填空目标进行训练。
模型使用NVIDIA NeMo™框架在NVIDIA Eos超级计算机上训练,该计算机由NVIDIA DGX H100系统构建。
使用
预期用途
该模型基于GitHub代码以及其他精选数据源(如Arxiv和Wikipedia)进行训练。因此,它不是一个指令模型,像“编写一个计算平方根的函数”这样的命令效果不佳。
生成
以下是一些使用该模型的示例。您可以在StarCoder2的GitHub仓库中找到微调脚本。
首先,确保从源代码安装transformers
:
pip install git+https://github.com/huggingface/transformers.git
在CPU/GPU/多GPU上运行模型
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/starcoder2-15b"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "bigcode/starcoder2-15b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 32251.33 MB
通过bitsandbytes
量化版本
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
checkpoint = "bigcode/starcoder2-15b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 16900.18 MB
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 9224.60 MB
归属及其他要求
模型的预训练数据集经过筛选,仅包含宽松许可证和无许可证的代码。尽管如此,模型仍可能生成与数据集中完全相同的源代码。代码的许可证可能要求归属或其他特定要求,必须遵守。我们提供了一个搜索索引,帮助您搜索预训练数据,识别生成代码的来源,并为您的代码应用适当的归属。
局限性
该模型在600多种编程语言的源代码上进行训练。源代码中主要使用英语,但也包含其他语言。因此,模型能够在提供一定上下文的情况下生成代码片段,但不能保证生成的代码按预期工作。它可能效率低下、包含错误或漏洞。有关模型局限性的深入讨论,请参阅论文。
训练
模型
- 架构: 具有分组查询和滑动窗口注意力机制及填空目标的Transformer解码器
- 预训练步数: 100万
- 预训练标记: 4+万亿
- 精度: bfloat16
硬件
软件
许可证
该模型采用BigCode OpenRAIL-M v1许可证协议。您可以在此处找到完整的协议。
引用
@misc{lozhkov2024starcoder,
title={StarCoder 2 and The Stack v2: The Next Generation},
author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
year={2024},
eprint={2402.19173},
archivePrefix={arXiv},
primaryClass={cs.SE}
}