论文标题
多户会恶意软件模型
Multifamily Malware Models
论文作者
论文摘要
在训练机器学习模型时,准确性与数据集多样性之间可能会有一个权衡。先前的研究表明,如果我们训练一个模型来检测一个特定的恶意软件家族,我们通常会获得更强大的结果,而我们培训单个模型对多种不同的家庭进行培训。但是,在检测阶段,拥有一个可以可靠地检测多个家族的单个模型,而不必在多个模型上为每个样本进行评分,将会更有效。在这项研究中,我们基于字节$ n $ -gr的功能进行了实验,以量化训练数据集的通用性与相应机器学习模型的准确性之间的关系,均在恶意软件检测问题的背景下。我们发现,基于邻里的算法概括了令人惊讶的概括,远远优于考虑的其他机器学习技术。
When training a machine learning model, there is likely to be a tradeoff between accuracy and the diversity of the dataset. Previous research has shown that if we train a model to detect one specific malware family, we generally obtain stronger results as compared to a case where we train a single model on multiple diverse families. However, during the detection phase, it would be more efficient to have a single model that can reliably detect multiple families, rather than having to score each sample against multiple models. In this research, we conduct experiments based on byte $n$-gram features to quantify the relationship between the generality of the training dataset and the accuracy of the corresponding machine learning models, all within the context of the malware detection problem. We find that neighborhood-based algorithms generalize surprisingly well, far outperforming the other machine learning techniques considered.