CodeFill：通过共同从结构和命名序列中学习的多键代码完成

论文标题

CodeFill：通过共同从结构和命名序列中学习的多键代码完成

CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences

论文作者

Izadi, Maliheh, Gismondi, Roberta, Gousios, Georgios

论文摘要

代码完成是IDE的重要功能，但是当前的自动完成器仅限于基于语法或基于NLP的单代币完成。这两种方法都有重要的缺点：基于语法的自动完成受到动态型语言环境的限制，而基于NLP的自动完成器则难以理解编程语言的语义和开发人员代码上下文的语义。在这项工作中，我们提出了CodeFill，这是一种自动完成的语言模型，结合了学习的结构和命名信息。使用并行的变压器体系结构和多任务学习，CodeFill消耗了源代码令牌名称及其等效的AST令牌类型的序列。独特的是，对CodeFill进行了单口和多键（语句）预测的培训，这使其能够学习语法和命名元素之间的长期依赖性。我们在两个数据集上训练CodeFill，分别包括29m和425m的代码线。为了使评估更加现实，我们开发了一种方法来自动推断完成的源代码中的点。我们将代码填充与四个基线和两个最先进的模型进行比较，GPT-C和Travtrans+.CodeFill超过单一令牌预测中的所有基准（MRR：70.9％：66.2％和67.8％），远远超过了Multi-Token Token Thectication（Rouge-l：Rouge-L：63.7％vs. 52.4％vs. 63.7％vss）的状态令牌）。我们公开发布源代码和数据集。

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+.CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题