开源replit-code-v1-3b代码生成模型，支持20种编程语言，免费上手！

首页

Replit Code V1 3b

由 replit 开发

由Replit开发的27亿参数代码生成模型，支持20种编程语言

大型语言模型

Transformers

其他#代码补全 #多语言编程 #低资源推理

下载量 605

发布时间 : 4/28/2023

模型简介

专注于代码补全的因果语言模型，基于Stack Dedup数据集训练，支持多种编程语言的代码生成和补全

模型特点

多语言支持

支持20种编程语言的代码生成和补全

高效训练技术

采用Flash Attention加速训练和推理，使用AliBi位置编码支持可变上下文长度

优化器创新

使用LionW优化器提高训练效率

大规模训练

基于1750亿token的数据集训练3个epoch，总计5250亿token

模型能力

代码补全

代码生成

多语言支持

上下文感知编程

使用案例

开发辅助

代码自动补全

在IDE中提供智能代码补全建议

提高开发效率

函数生成

根据函数签名或注释生成完整函数实现

pass@1得分0.219(HumanEval)

教育

编程学习辅助

为学习者提供代码示例和解释

🚀 replit-code-v1-3b

replit-code-v1-3b 是一个专注于代码补全的 27 亿参数的因果语言模型。该模型基于 Stack Dedup v1.2 数据集的一个子集进行训练，可助力开发者更高效地完成代码编写。

🧑‍💻 在我们的演示空间中测试它！ 🧑‍💻

⚙️ 微调与指令调优指南 ⚙️

✨ 主要特性

多语言支持：支持 Markdown、Java、JavaScript、Python 等 20 种不同的编程语言。
大规模训练：在包含 1750 亿个标记的数据集上进行了 3 个轮次的训练，总共训练了 5250 亿个标记。
先进技术：采用了 Flash Attention、AliBi 位置嵌入、LionW 优化器等先进的大语言模型技术。

📦 安装指南

首先，你需要安装以下依赖项的最新版本：

einops
sentencepiece
torch
transformers

若要在支持 BF16 精度的 GPU 上使用 FlashAttention 的优化 Triton 实现，还需安装以下依赖项：

flash-attn==0.2.8
triton==2.0.0.dev20221202

若要使用 8 位量化加载模型，需安装以下额外依赖项：

accelerate
bitsandbytes

若要使用 4 位量化加载模型，需从发布仓库的 main 分支安装依赖项：

pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM

# 加载模型
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

高级用法

使用优化的 Triton 实现的 FlashAttention

from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained(
    "replit/replit-code-v1-3b",
    trust_remote_code=True
)
config.attn_config['attn_impl'] = 'triton'

# 加载模型
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', config=config, trust_remote_code=True)
model.to(device='cuda:0', dtype=torch.bfloat16)

# 前向传播
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)

分词器使用

from transformers import AutoTokenizer

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# 单输入编码 + 生成
x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
y = model.generate(x)

# 解码，clean_up_tokenization_spaces=False 以确保语法正确性
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

代码生成

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# 解码，clean_up_tokenization_spaces=False 以确保语法正确性
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

8 位量化加载

model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_8bit=True)

4 位量化加载

model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_4bit=True)

📚 详细文档

模型描述

replit-code-v1-3b 是一个专注于代码补全的 27 亿参数的因果语言模型。该模型在 MosaicML 平台上使用 256 个 A100 - 40GB GPU 进行训练。

训练混合数据集中包含 20 种不同的语言，按标记数量降序排列如下： Markdown、Java、JavaScript、Python、TypeScript、PHP、SQL、JSX、reStructuredText、Rust、C、CSS、Go、C++、HTML、Vue、Ruby、Jupyter Notebook、R、Shell

预期用途

Replit 希望该模型能被任何人用作特定应用微调的基础模型，且对商业用途没有严格限制。

局限性

预训练数据集即使在应用数据清理过滤器后，仍可能包含冒犯性或不适当的内容，这些内容可能会反映在模型生成的文本中。建议用户在生产系统中使用时保持合理的谨慎，不要将其用于可能对个人或群体造成伤害或困扰的任何应用。

后处理

与所有代码生成模型一样，对生成的代码进行后处理非常重要。特别推荐以下后处理步骤：

遇到 EOS 标记时停止生成。
去除尾随空格。
根据你的补全用例将 max_tokens 设置为合理的值。
将生成截断到 return、def、"```"、"\n\n\n" 等停止词，以避免在 max_tokens 大于预期生成代码的长度时生成不完整的代码。

🔧 技术细节

训练数据：模型在 Stack Dedup v1.2 数据集的一个子集上进行训练，训练数据集总共包含 1750 亿个标记，重复训练了 3 个轮次，总共训练了 5250 亿个标记（每个参数约 195 个标记）。
训练平台：在 MosaicML 平台上使用 256 个 A100 - 40GB GPU 进行训练。
技术实现：采用了 Flash Attention 实现快速训练和推理，AliBi 位置嵌入支持推理时的可变上下文长度，以及 LionW 优化器等技术。