pipeline_tag: 文本生成
inference: true
widget:
- text: 'def print_hello_world():'
example_title: Hello world
group: Python
datasets:
- bigcode/the-stack-v2-train
license: bigcode-openrail-m
library_name: transformers
tags:
- 代码
model-index:
- name: starcoder2-3b
results:
- task:
type: 文本生成
dataset:
name: CruxEval-I
type: cruxeval-i
metrics:
- task:
type: 文本生成
dataset:
name: DS-1000
type: ds-1000
metrics:
- task:
type: 文本生成
dataset:
name: GSM8K (PAL)
type: gsm8k-pal
metrics:
- task:
type: 文本生成
dataset:
name: HumanEval+
type: humanevalplus
metrics:
- task:
type: 文本生成
dataset:
name: HumanEval
type: humaneval
metrics:
- task:
type: 文本生成
dataset:
name: RepoBench-v1.1
type: repobench-v1.1
metrics:
StarCoder2
目录
- 模型概览
- 使用
- 局限性
- 训练
- 许可证
- 引用
模型概览
StarCoder2-3B模型是一个30亿参数的模型,基于The Stack v2中的17种编程语言训练,排除了退出请求的数据。该模型采用分组查询注意力,拥有16,384个标记的上下文窗口和4,096个标记的滑动窗口注意力,并使用中间填充目标在超过3万亿标记上进行了训练。
使用
预期用途
该模型基于GitHub代码以及Arxiv和维基百科等额外精选数据源训练。因此,它_不是_一个指令模型,像"编写一个计算平方根的函数"这样的命令效果不佳。
生成
以下是一些使用该模型的示例。您可以在StarCoder2的GitHub仓库中找到微调脚本。
首先,确保从源码安装transformers
:
pip install git+https://github.com/huggingface/transformers.git
在CPU/GPU/多GPU上运行模型
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/starcoder2-3b"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 12624.81 MB
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "bigcode/starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 6312.41 MB
通过bitsandbytes
量化版本
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
checkpoint = "bigcode/starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 3434.07 MB
>>> print(f"内存占用: {model.get_memory_footprint() / 1e6:.2f} MB")
内存占用: 1994.90 MB
归属及其他要求
模型的预训练数据集筛选了宽松许可证及无许可证代码。尽管如此,模型可能生成与数据集完全相同的源代码。代码的许可证可能要求归属或其他特定要求。我们提供了搜索索引,帮助您搜索预训练数据以确定生成代码的来源,并为您的代码正确归属。
局限性
该模型基于600多种编程语言的源代码训练。源代码中主要语言为英语,但也包含其他语言。因此,模型能在提供上下文的情况下生成代码片段,但不能保证生成的代码按预期工作。可能存在低效、错误或漏洞。详见论文深入讨论模型局限性。
训练
模型
- 架构: 具有分组查询和滑动窗口注意力及中间填充目标的Transformer解码器
- 预训练步数: 120万
- 预训练标记: 3+万亿
- 精度: bfloat16
硬件
软件
许可证
该模型采用BigCode OpenRAIL-M v1许可证协议。完整协议见此处。
引用
@misc{lozhkov2024starcoder,
title={StarCoder 2 and The Stack v2: The Next Generation},
author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
year={2024},
eprint={2402.19173},
archivePrefix={arXiv},
primaryClass={cs.SE}
}