nlgp - docstring开源Python代码生成模型 - 根据自然语言合成代码！

首页

Nlgp Docstring

由 Nokia 开发

基于Jupyter笔记本训练的Python代码生成模型，可根据自然语言意图在代码上下文中合成代码

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #Python代码生成 #上下文感知编程 #Jupyter笔记本训练

下载量 45

发布时间 : 3/2/2022

模型简介

该模型用于根据自然语言描述（意图）在特定Python代码上下文中生成相应的代码片段，特别适合辅助编程任务。

模型特点

上下文感知代码生成

能够理解现有代码上下文，生成符合上下文的代码片段

自然语言理解

可将自然语言描述转换为有效的Python代码实现

特殊空白符处理

使用特殊标记处理代码缩进，保持生成代码的正确格式

模型能力

Python代码补全

基于自然语言的代码生成

上下文感知编程辅助

使用案例

编程辅助

数据可视化代码生成

根据自然语言描述生成matplotlib可视化代码

示例中成功生成plt.bar()柱状图代码

代码补全

在已有代码基础上补全后续逻辑

🚀 NLGP文档字符串模型

NLGP文档字符串模型在论文自然语言引导编程中被提出。该模型在一组Jupyter笔记本上进行训练，可用于合成Python代码，以在特定代码上下文中实现自然语言意图（见下面的示例）。也可查看NLGP自然模型。

这项工作由诺基亚贝尔实验室的一个研究团队完成。

🚀 快速开始

NLGP文档字符串模型可根据给定的代码上下文和自然语言意图生成相应的Python代码。以下是一个简单示例，展示了如何使用该模型根据上下文和意图生成代码。

示例

上下文

import matplotlib.pyplot as plt

values = [1, 2, 3, 4]
labels = ["a", "b", "c", "d"]

意图

# plot a bart chart

预测结果

plt.bar(labels, values)
plt.show()

💻 使用示例

基础用法

以下是使用NLGP文档字符串模型的完整代码示例：

import re
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# load the model
tok = GPT2TokenizerFast.from_pretrained("Nokia/nlgp-docstring")
model = GPT2LMHeadModel.from_pretrained("Nokia/nlgp-docstring") 

# preprocessing functions
num_spaces = [2, 4, 6, 8, 10, 12, 14, 16, 18]
def preprocess(context, query):
    """
    Encodes context + query as a single string and 
    replaces whitespace with special tokens <|2space|>, <|4space|>, ...
    """
    input_str = f"{context}\n{query} <|endofcomment|>\n"
    indentation_symbols = {n: f"<|{n}space|>" for n in num_spaces}
    m = re.match("^[ ]+", input_str)
    if not m:
        return input_str
    leading_whitespace = m.group(0)
    N = len(leading_whitespace)
    for n in self.num_spaces:
        leading_whitespace = leading_whitespace.replace(n * " ", self.indentation_symbols[n])
    return leading_whitespace + input_str[N:]
    
detokenize_pattern = re.compile(fr"<\|(\d+)space\|>")
def postprocess(output):
    output = output.split("<|cell|>")[0]
    def insert_space(m):
        num_spaces = int(m.group(1))
        return num_spaces * " "
    return detokenize_pattern.sub(insert_space, output)

# inference
code_context = """
import matplotlib.pyplot as plt

values = [1, 2, 3, 4]
labels = ["a", "b", "c", "d"]
"""
query = "# plot a bar chart"

input_str = preprocess(code_context, query)
input_ids = tok(input_str, return_tensors="pt").input_ids

max_length = 150 # don't generate output longer than this length
total_max_length = min(1024 - input_ids.shape[-1], input_ids.shape[-1] + 150) # total = input + output

input_and_output = model.generate(
    input_ids=input_ids, 
    max_length=total_max_length,
    min_length=10,
    do_sample=False,
    num_beams=4,
    early_stopping=True,
    eos_token_id=tok.encode("<|cell|>")[0]
)

output = input_and_output[:, input_ids.shape[-1]:] # remove the tokens that correspond to the input_str
output_str = tok.decode(output[0])
postprocess(output_str)