Cocosoda：有效的对比度学习代码搜索

论文标题

Cocosoda：有效的对比度学习代码搜索

CoCoSoDa: Effective Contrastive Learning for Code Search

论文作者

Shi, Ensheng, Wang, Yanlin, Gu, Wenchao, Du, Lun, Zhang, Hongyu, Han, Shi, Zhang, Dongmei, Sun, Hongbin

论文摘要

代码搜索旨在检索给定自然语言查询的语义相关代码段。最近，采用对比度学习的许多方法在代码表示学习上显示出令人鼓舞的结果，并大大提高了代码搜索的性能。但是，在使用对比度学习进行代码搜索方面仍然有很大的改进空间。在本文中，我们建议可可通过对比度学习的两个关键因素有效利用对比度学习进行代码搜索：数据增强和负样本。具体而言，软数据的增强是动态掩盖或替换某些令牌的输入序列以生成正样本。动量机制用于通过保持队列和动量编码器来生成在迷你批次中的较大且一致的表示。此外，多模式对比度学习用于将代码Query对的表示形式汇总在一起，并将未配对的代码片段和查询推开。我们进行了广泛的实验，以评估使用六种编程语言的大规模数据集上的方法的有效性。实验结果表明：（1）可可的表现优于14个基线，尤其超过Codebert，GraphCodebert和Unixcoder，平均MRR得分分别为13.3％，10.5％和5.9％。（2）消融研究显示了我们方法的每个组成部分的有效性。（3）我们将技术调整到多种不同的预训练模型，例如Roberta，Codebert和GraphCodebert，并观察到它们在代码搜索中的性能显着提高。（4）我们的模型在不同的超参数下表现稳健。此外，我们进行定性和定量分析，以探索模型良好表现的原因。

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题