关于恶意软件分类的持续学习的局限性

论文标题

关于恶意软件分类的持续学习的局限性

On the Limitations of Continual Learning for Malware Classification

论文作者

Rahman, Mohammad Saidur, Coull, Scott E., Wright, Matthew

论文摘要

恶意软件（恶意软件）分类为持续学习（CL）制度提供了独特的挑战，这是由于每天收到的新样本的数量以及恶意软件的演变以利用新漏洞。在典型的一天中，防病毒供应商将获得数十万个恶意和良性的独特软件，并且在恶意软件分类器的一生中，有超过十亿个样品很容易积累。鉴于问题的规模，使用持续学习技术的顺序培训可以在减少培训和存储开销方面提供可观的好处。但是，迄今为止，还没有对CL应用于恶意软件分类任务的探索。在本文中，我们研究了11种应用于三个恶意软件任务的CL技术，涵盖了常见的增量学习方案，包括任务，类和域增量学习（IL）。具体而言，使用两个现实的大规模恶意软件数据集，我们评估了CL方法在二进制恶意软件分类（domain-il）和多类恶意软件家庭分类（Task-IL和类IL）任务上的性能。令我们惊讶的是，在几乎所有情况下，持续的学习方法都显着不足以使训练数据的天真关节重播 - 在某些情况下，将准确性降低了70个百分点以上。与关节重播相比，有选择地重播20％的存储数据的一种简单方法可实现更好的性能，占训练时间的50％。最后，我们讨论了CL技术表现出乎意料差的潜在原因，希望它可以进一步研究开发在恶意软件分类领域更有效的技术。

Malicious software (malware) classification offers a unique challenge for continual learning (CL) regimes due to the volume of new samples received on a daily basis and the evolution of malware to exploit new vulnerabilities. On a typical day, antivirus vendors receive hundreds of thousands of unique pieces of software, both malicious and benign, and over the course of the lifetime of a malware classifier, more than a billion samples can easily accumulate. Given the scale of the problem, sequential training using continual learning techniques could provide substantial benefits in reducing training and storage overhead. To date, however, there has been no exploration of CL applied to malware classification tasks. In this paper, we study 11 CL techniques applied to three malware tasks covering common incremental learning scenarios, including task, class, and domain incremental learning (IL). Specifically, using two realistic, large-scale malware datasets, we evaluate the performance of the CL methods on both binary malware classification (Domain-IL) and multi-class malware family classification (Task-IL and Class-IL) tasks. To our surprise, continual learning methods significantly underperformed naive Joint replay of the training data in nearly all settings -- in some cases reducing accuracy by more than 70 percentage points. A simple approach of selectively replaying 20% of the stored data achieves better performance, with 50% of the training time compared to Joint replay. Finally, we discuss potential reasons for the unexpectedly poor performance of the CL techniques, with the hope that it spurs further research on developing techniques that are more effective in the malware classification domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题