轻量级辅助脆弱性发现

论文标题

轻量级辅助脆弱性发现

Featherweight Assisted Vulnerability Discovery

论文作者

Binkley, David, Moonen, Leon, Isaacman, Sibren

论文摘要

预测脆弱的源代码有助于将注意力集中在需要检查需要检查的那些部分。最近的工作提出，将功能名称用作语义提示，深层神经网络（DNN）可以学习，以帮助追求函数的脆弱性。将每个函数名称分为构成单词的标识符拆分与一种基于新颖的频率算法相结合，我们探索构成函数名称的单词可以预测潜在的脆弱函数的程度。与仅考虑功能名称的DNN的 *轻量级 *预测相反，避免使用DNN提供 *轻量级 *预测。根本的想法是包含某些“危险”单词的函数名称更可能伴随脆弱的功能。当然，这假设可以正确调整基于频率的算法以专注于真正危险的单词。因为它比DNN更透明，所以基于频率的算法使我们能够研究DNN的内部工作。如果成功的话，对DNN所做和不学习的事情的调查将有助于我们培训更有效的未来模型。我们从经验上评估了我们的方法在包含易受伤害的73000多个功能的异质数据集上，并标记为良性的950000多个功能。我们的分析表明，仅单词是DNN分类能力的很大一部分。我们还发现，在数据集中，单词具有最大的价值，具有更均匀的词汇。因此，当在给定项目的范围内工作，而词汇不可避免地是同质的，我们的方法提供了一种更便宜的，潜在的互补性技术来帮助寻找源代码代码漏洞。最后，这种方法具有一个优势，即训练数据的数量级较小，它是可行的。

Predicting vulnerable source code helps to focus attention on those parts of the code that need to be examined with more scrutiny. Recent work proposed the use of function names as semantic cues that can be learned by a deep neural network (DNN) to aid in the hunt for vulnerability of functions. Combining identifier splitting, which splits each function name into its constituent words, with a novel frequency-based algorithm, we explore the extent to which the words that make up a function's name can predict potentially vulnerable functions. In contrast to *lightweight* predictions by a DNN that considers only function names, avoiding the use of a DNN provides *featherweight* predictions. The underlying idea is that function names that contain certain "dangerous" words are more likely to accompany vulnerable functions. Of course, this assumes that the frequency-based algorithm can be properly tuned to focus on truly dangerous words. Because it is more transparent than a DNN, the frequency-based algorithm enables us to investigate the inner workings of the DNN. If successful, this investigation into what the DNN does and does not learn will help us train more effective future models. We empirically evaluate our approach on a heterogeneous dataset containing over 73000 functions labeled vulnerable, and over 950000 functions labeled benign. Our analysis shows that words alone account for a significant portion of the DNN's classification ability. We also find that words are of greatest value in the datasets with a more homogeneous vocabulary. Thus, when working within the scope of a given project, where the vocabulary is unavoidably homogeneous, our approach provides a cheaper, potentially complementary, technique to aid in the hunt for source-code vulnerabilities. Finally, this approach has the advantage that it is viable with orders of magnitude less training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题