探测据预源代码模型

论文标题

探测据预源代码模型

Probing Pretrained Models of Source Code

论文作者

Troshin, Sergey, Chirkova, Nadezhda

论文摘要

深度学习模型被广泛用于解决挑战的代码处理任务，例如代码生成或代码摘要。传统上，精心构建了特定的模型体系结构，以解决特定的代码处理任务。但是，最近经过普遍审计的模型（例如Codebert或Codet5）已显示出在许多应用程序中的特定特定模型。虽然已知的模型可以从数据中学习复杂的模式，但它们可能无法理解源代码的某些属性。为了测试代码理解的各个方面，我们介绍了一组诊断探测任务。我们表明，经过审计的代码模型确实包含有关代码句法结构和正确性，标识符，数据流和名称空间的概念以及自然语言命名的信息。我们还研究了如何通过使用特定于代码的预算预处理的目标来影响探测结果，改变模型大小或填充。

Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题