在各种丢失的数据问题中建立denoising自动编码器的强大插补性能

论文标题

在各种丢失的数据问题中建立denoising自动编码器的强大插补性能

Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems

论文作者

Abiri, Najmeh, Linse, Björn, Edén, Patrik, Ohlsson, Mattias

论文摘要

在数据分析中处理丢失的数据是不可避免的。尽管存在解决此问题的强大插补方法，但仍有很大的改进空间。在这项研究中，我们检查了基于深度自动编码器的单次插补，这是由于深度学习有效提取有用的数据集功能的明显成功所致。我们已经为培训和插补开发了一个一致的框架。此外，我们针对不同数据大小和特征的最先进的插补方法对结果进行了基准测试。这项工作不仅限于单类变量数据集；我们还使用多类变量估算丢失的数据，例如二进制，分类和连续属性的组合。为了评估插补方法，我们以不同程度的腐败时间随机损坏了完整的数据，然后比较了估算的值和原始值。在所有实验中，开发的自动编码器均为初始数据损坏的所有范围都获得了最小的错误。

Dealing with missing data in data analysis is inevitable. Although powerful imputation methods that address this problem exist, there is still much room for improvement. In this study, we examined single imputation based on deep autoencoders, motivated by the apparent success of deep learning to efficiently extract useful dataset features. We have developed a consistent framework for both training and imputation. Moreover, we benchmarked the results against state-of-the-art imputation methods on different data sizes and characteristics. The work was not limited to the one-type variable dataset; we also imputed missing data with multi-type variables, e.g., a combination of binary, categorical, and continuous attributes. To evaluate the imputation methods, we randomly corrupted the complete data, with varying degrees of corruption, and then compared the imputed and original values. In all experiments, the developed autoencoder obtained the smallest error for all ranges of initial data corruption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题