论文标题

一种用于不平衡数据集优化的新颖的重采样技术

A Novel Resampling Technique for Imbalanced Dataset Optimization

论文作者

Letteri, Ivan, Di Cecco, Antonio, Dyoub, Abeer, Della Penna, Giuseppe

论文摘要

尽管有大量数据,但特定感兴趣的事件仍然很少见。在许多领域中,罕见事件的分类是一个常见的问题,例如欺诈交易,恶意软件流量分析和网络入侵检测。已经在各种数据集上使用机器学习方法开发了许多研究,以进行恶意软件检测,但是据我们所知,只有MTA-KDD'19数据集具有每天更新代表性的恶意流量集的特殊性。每日更新是数据集的附加值,但是由于类不平衡问题,RRW优化的MTA-KDD'19将会发生这种潜力。我们通过考虑四种类型的少数族裔示例来捕捉实际数据集中的类别分布的困难:安全,边界,稀有和异常值。在这项工作中,我们开发了两种版本的生成轮廓重新采样1-纳尔最邻居(G1NOS)过采样算法,用于处理类不平衡问题。 G1NOS算法的第一个模块执行基于系数的实例选择轮廓,以识别不平衡度的关键阈值。 (ID),第二个模块使用类似Smote的过采样算法生成合成样品。类的平衡是由我们的G1NOS算法完成的,以重新建立使用的数据集的两个类之间的比例。实验结果表明,在所有考虑的指标中,我们的过采样算法比其他两种SOTA方法更好。

Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源