论文标题

Seismoflow-集体不平衡问题的数据增强

SeismoFlow -- Data augmentation for the class imbalance problem

论文作者

Milidiú, Ruy Luiz, Müller, Luis Felipe

论文摘要

在几个应用领域,例如医学诊断,垃圾邮件过滤,欺诈检测和地震数据分析,通常可以找到某些类别发生的相关分类任务是非常常见的。这是所谓的类不平衡问题,这是机器学习的挑战。在这项工作中,我们提出了一个基于流的生成模型来创建合成样本,旨在解决阶级失衡。受到发光模型的启发,它在学习的潜在空间上使用插值来为一个罕见类别产生合成样品。我们将我们的方法应用于地震图信号质量分类器的开发。我们介绍了一个由5.223观察图组成的数据集,该数据集分布在好,中和坏类之间,其各自的频率为66.68%,31.54%和1.76%。我们的方法是在分层的10倍交叉验证设置上评估的,使用MinCeptionModel作为基线,并评估添加生成样品对每次迭代训练集的影响。在我们的实验中,我们在罕见的F1分数方面取得了13.9%的提高,同时没有损害其他类别的度量值,从而观察到总体准确性的提高。我们的经验发现表明,我们的方法可以以逼真的外观和足够的多元化来生成高质量的合成地震图,以帮助小吸收模型克服阶级不平衡问题。我们认为,我们的结果是解决地震图信号质量分类和阶级失衡的任务迈出的一步。

In several application areas, such as medical diagnosis, spam filtering, fraud detection, and seismic data analysis, it is very usual to find relevant classification tasks where some class occurrences are rare. This is the so called class imbalance problem, which is a challenge in machine learning. In this work, we propose the SeismoFlow a flow-based generative model to create synthetic samples, aiming to address the class imbalance. Inspired by the Glow model, it uses interpolation on the learned latent space to produce synthetic samples for one rare class. We apply our approach to the development of a seismogram signal quality classifier. We introduce a dataset composed of5.223seismograms that are distributed between the good, medium, and bad classes and with their respective frequencies of 66.68%,31.54%, and 1.76%. Our methodology is evaluated on a stratified 10-fold cross-validation setting, using the Miniceptionmodel as a baseline, and assessing the effects of adding the generated samples on the training set of each iteration. In our experiments, we achieve an improvement of 13.9% on the rare class F1-score, while not hurting the metric value for the other classes and thus observing the overall accuracy improvement. Our empirical findings indicate that our method can generate high-quality synthetic seismograms with realistic looking and sufficient plurality to help the Miniception model to overcome the class imbalance problem. We believe that our results are a step forward in solving both the task of seismogram signal quality classification and class imbalance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源