starcoder-gpteacher-code-instruct开源模型 - 免费部署提升代码生成与解释能力

首页

Starcoder Gpteacher Code Instruct

由 GeorgiaTechResearchInstitute 开发

基于StarCoder模型，使用GPTeacher代码生成数据集进行微调，优化了代码生成和解释能力

大型语言模型

Transformers

开源协议:Openrail #代码指令微调 #多语言代码生成 #8192长上下文

下载量 122

发布时间 : 5/5/2023

模型简介

本模型是基于155亿参数的StarCoder模型，通过GPT-4生成的代码指令数据进行微调，专注于代码生成和解释任务，支持80+种编程语言

模型特点

大上下文窗口

支持8192token的上下文窗口，适合处理长代码片段

多语言支持

训练数据涵盖80+种编程语言，具有广泛的语言适应性

指令微调优化

使用GPT-4生成的代码指令数据进行微调，对用户指令响应更佳

模型能力

代码生成

代码解释

编程问题解答

代码补全

使用案例

代码开发辅助

函数生成

根据自然语言描述生成特定功能的代码函数

能生成符合要求的函数实现

代码解释

解释复杂代码片段的逻辑和功能

提供清晰准确的代码解释

编程教育

编程学习辅助

帮助学习者理解编程概念和代码实现

提供易于理解的解释和示例

🚀 StarCoder GPTeacher-Codegen 微调模型

本模型是基于Transformer架构的文本生成模型，它在预训练模型的基础上进行了微调，能够根据给定的指令生成代码，在代码生成领域具有较高的实用性。

🚀 快速开始

本模型是在bigcode/starcoder的基础上，使用teknium1/GPTeacher代码生成数据集（GPT - 4代码指令微调）进行微调得到的。

✨ 主要特性

多语言支持：基础的StarCoder模型是一个具有155亿参数的模型，在来自The Stack (v1.2)的80多种编程语言上进行训练，排除了选择退出请求的数据。
先进技术应用：模型使用了多查询注意力机制（Multi Query Attention）、8192个标记的上下文窗口，并在1万亿个标记上使用中间填充目标（Fill - in - the - Middle objective）进行训练。
信息资源丰富：
- 仓库地址：[bigcode/Megatron - LM](https://github.com/bigcode-project/Megatron - LM)
- 项目网站：[bigcode - project.org](https://www.bigcode - project.org)
- 相关论文：[💫StarCoder: May the source be with you!](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)
- 联系邮箱：[contact@bigcode - project.org](mailto:contact@bigcode - project.org)

💻 使用示例

基础用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
device = "cuda"

input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
                "### Instruction:\n"
                "{instruction}\n\n"
                "### Input:\n"
                "{input}\n\n"
                "### Response:")

prompt = "Please explain the following program."
extra_input = "send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7:      *to = *from++; case 6:      *to = *from++; case 5:      *to = *from++; case 4:      *to = *from++; case 3:      *to = *from++; case 2:      *to = *from++; case 1:      *to = *from++; } while (--n > 0); } }"
prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
                  
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

示例输出

示例输出详情

完整提示:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
What does this program do and how does it work?
### Input:
send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    }
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
    }
}
### Response:

回复内容:

This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.

To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:

foo(to, from, count)

归因及其他要求

该模型的预训练数据集仅筛选了许可宽松的内容。不过，模型可能会直接生成数据集中的源代码。代码的许可协议可能要求进行归因和/或满足其他特定要求，必须予以遵守。BigCode项目提供了一个搜索索引，可用于搜索预训练数据，以确定生成的代码来源，并对代码进行适当的归因。

📚 详细文档

预期用途

基础模型在GitHub代码上进行训练，然后进行微调以遵循指令。像“编写一个计算平方根的函数”这样的提示应该能有较好的效果。原仓库建议使用[技术助手提示](https://huggingface.co/datasets/bigcode/ta - prompt)进行少样本提示，使模型表现得像一个技术助手。这个微调模型使用了[Alpaca提示](https://github.com/tatsu - lab/stanford_alpaca/blob/main/train.py)。

局限性

该模型在80多种编程语言的源代码上进行训练。源代码中主要使用英语，也包含其他语言。因此，模型能够在一定上下文下生成代码片段，但生成的代码不能保证按预期工作，可能效率低下、包含错误或漏洞。有关模型局限性的深入讨论，请参阅[原始论文](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)。微调过程使模型对用户的直接输入更具响应性，但这只是对Starcoder模型进行指令微调的早期尝试，结果可能无法代表模型的全部潜力。

训练情况

模型参数

属性	详情
架构	具有多查询注意力和中间填充目标的GPT - 2模型
预训练步数	250k
预训练标记数	1万亿
精度	bfloat16
微调指令 - 响应对数	4.5k
微调上下文长度	1024
微调轮数	3
微调学习率	2e - 5
微调优化方法	FSDP

硬件资源

属性	详情
GPU	8个Tesla A100
训练时间	5小时

📄 许可证

该模型遵循BigCode OpenRAIL - M v1许可协议。您可以在[此处](https://huggingface.co/spaces/bigcode/bigcode - model - license - agreement)找到完整协议。此模型还使用了OpenAI的GPT - 4的输出进行微调，因此还需遵守[OpenAI的服务条款](https://openai.com/policies/terms - of - use)。

引用信息

基础模型的Hugging Face仓库可在此处找到。

@article{li2023starcoder,
      title={StarCoder: may the source be with you!}, 
      author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy-Poirier and João Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming-Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan-Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muñoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2023},
      eprint={2305.06161},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM排行榜评估结果

详细结果可在[此处](https://huggingface.co/datasets/open - llm - leaderboard/details_GeorgiaTechResearchInstitute__starcoder - gpteacher - code - instruct)查看。

指标	值
平均值	32.57
ARC (25 - shot)	32.68
HellaSwag (10 - shot)	47.6
MMLU (5 - shot)	28.63
TruthfulQA (0 - shot)	40.41
Winogrande (5 - shot)	55.56
GSM8K (5 - shot)	0.0
DROP (3 - shot)	23.11