论文标题
Robrose:在欺诈检测中处理不平衡数据的强大方法
robROSE: A robust approach for dealing with imbalanced data in fraud detection
论文作者
论文摘要
试图检测欺诈行为时的一个主要挑战是,欺诈活动构成了少数群体,占数据集的很小比例。在大多数数据集中,欺诈通常不到案例的0.5%。在如此高度不平衡的数据集中检测欺诈通常会导致有利于多数组的预测,从而导致欺诈行为仍未被发现。我们讨论了一些流行的过采样技术,这些技术通过创建模仿少数类别的合成样本来解决不平衡数据的问题。分析真实数据时经常存在的问题是存在异常或异常值。当数据中存在这种非典型观察结果时,大多数过采样技术都容易产生扭曲检测算法并破坏结果分析的合成样品。一个有用的异常检测工具是稳健的统计数据,旨在通过首先拟合大多数数据,然后标记偏离该数据的数据观测值来找到异常值。在本文中,我们提出了一个强大的Rose版本,称为Robrose,该版本结合了几种有前途的方法,可以同时应对数据的问题和异常值的存在。所提出的方法可以在忽略异常情况的同时增强欺诈案件的存在。在模拟和真实的数据集中说明了我们的新采样技术的良好性能,并且表明Robrose可以在数据结构中提供更好的见解。 Robrose算法的源代码可自由使用。
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than 0.5% of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.