ernie-code-560m开源模型 - 连接116种语言与6种编程语，支持跨语言任务

首页

Ernie Code 560m

由 baidu 开发

ERNIE-Code是一个统一的大型语言模型，连接了116种自然语言和6种编程语言，支持多种跨语言任务。

大型语言模型

Transformers

开源协议:MIT #多语言代码生成 #跨语言预训练 #零样本翻译

下载量 69

发布时间 : 3/9/2024

模型简介

ERNIE-Code是一个支持多语言和编程语言的大型语言模型，通过片段掩码语言建模和基于枢轴的翻译语言建模进行预训练，适用于代码到文本、文本到代码等多种任务。

模型特点

多语言支持

支持116种自然语言和6种编程语言，覆盖广泛的跨语言任务。

跨语言预训练

采用片段掩码语言建模和基于枢轴的翻译语言建模，提升模型在多语言任务上的表现。

零样本能力

在代码摘要和文本翻译任务上展示出优秀的零样本提示能力。

模型能力

多语言代码到文本生成

多语言文本到代码生成

多语言代码到代码生成

多语言文本到文本翻译

使用案例

代码智能

代码摘要

为多种编程语言的代码生成自然语言描述。

在多语言代码摘要任务上表现优异。

代码翻译

将一种编程语言的代码翻译为另一种编程语言。

在代码到代码生成任务上优于其他多语言模型。

自然语言处理

文本翻译

支持多种自然语言之间的文本翻译。

在零样本文本翻译任务上展示出优势。

🚀 ERNIE-Code

ERNIE-Code是一个统一的大语言模型（LLM），它将116种自然语言与6种编程语言连接起来。该模型采用两种预训练方法进行通用跨语言预训练，在代码智能的一系列最终任务中表现出色，包括多语言代码转文本、文本转代码、代码转代码和文本转文本生成等。

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

ernie-code-comp

ERNIE-Code采用了两种预训练方法进行通用跨语言预训练：一种是跨度损坏语言建模，可从单语言自然语言（NL）或编程语言（PL）中学习模式；另一种是基于枢轴的翻译语言建模，依赖于多种自然语言和编程语言的平行数据。大量实验结果表明，ERNIE-Code在代码智能的各种最终任务中，优于以往针对编程语言或自然语言的多语言大语言模型。此外，它在多语言代码摘要和文本到文本翻译的零样本提示方面也具有优势。

ACL 2023 (Findings) | arXiv

🚀 快速开始

ERNIE-Code是一个强大的统一大语言模型，能连接多种自然语言和编程语言。下面为你展示如何使用它进行相关任务。

💻 使用示例

基础用法

import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoModelForCausalLM,
    AutoTokenizer
)

model_name = "baidu/ernie-code-560m"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# note that you can use aforementioned `clean_up_code_spaces` to proprocess the code


def format_code_with_spm_compatablity(line: str):
    format_dict = {
        " " : "<|space|>"
    }
    tokens = list(line)
    i = 0
    while i < len(tokens):
        if line[i] == "\n":
            while i+1 < len(tokens) and tokens[i+1] == " ":
                tokens[i+1] = format_dict.get(" ")
                i += 1
        i += 1
    formatted_line = ''.join(tokens)
    return formatted_line

"""
TYPE="code" # define input type in ("code", "text")
input="arr.sort()"
prompt="translate python to java: \n%s" % (input)  # your prompt here
"""

TYPE="text" # define input type in ("code", "text")
input="quick sort"
prompt="translate English to Japanese: \n%s" % (input)  # your prompt here

assert TYPE in ("code", "text")

# preprocess for code input
if TYPE=="code":
    prompt = format_code_with_spm_compatablity(prompt)

model_inputs = tokenizer(prompt, max_length=512, padding=False, truncation=True, return_tensors="pt")

model = model.cuda() # by default
input_ids = model_inputs.input_ids.cuda() # by default
attention_mask = model_inputs.attention_mask.cuda() # by default

output = model.generate(input_ids=input_ids, attention_mask=attention_mask, 
        num_beams=5, max_length=20) # change to your needs

# Ensure to customize the post-processing of `clean_up_code_spaces` output according to specific requirements.
output = tokenizer.decode(output.flatten(), skip_special_tokens=True)


# post-process the code generation
def clean_up_code_spaces(s: str):
    # post process
    # ===========================
    new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>"*4, "<|space|>"*2, "<|space|>"]
    for tok in new_tokens:
        s = s.replace(f"{tok} ", tok)

    cleaned_tokens = ["<pad>", "</s>", "<unk>"]
    for tok in cleaned_tokens:
        s = s.replace(tok, "")
    s = s.replace("<|space|>", " ")
    return s
output = [clean_up_code_spaces(pred) for pred in output]

你可以参考seq2seq翻译代码进行微调。

你也可以查看PaddleNLP上的官方推理代码。

零样本示例

多语言代码转文本生成（零样本）

code-to-text-examples

zh_code-to-text_examples-1

多语言文本转文本翻译（零样本）

zero-shot-mt-examples

📚 详细文档

BibTeX引用

@inproceedings{chai-etal-2023-ernie,
    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
    author = "Chai, Yekun  and
      Wang, Shuohuan  and
      Pang, Chao  and
      Sun, Yu  and
      Tian, Hao  and
      Wu, Hua",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.676",
    pages = "10628--10650",
    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
}