论文标题
将变量转换为中心正态性
Transforming variables to central normality
论文作者
论文摘要
许多真实的数据集都包含数值特征(变量),其分布远非正常(高斯)。相反,它们的分布通常是偏斜的。为了处理此类数据,习惯上是预处理变量以使它们更正。 Box-Cox和Yeo-Johnson转换是众所周知的工具。但是,其转换参数的标准最大似然估计器对异常值高度敏感,并且通常会以数据中心部分的正态性为代价向内移动离群值。我们提出了对这些转换的修改以及对异常值鲁棒的转换参数的估计器,因此中心的转换数据可能大致正常,并且一些异常值可能会偏离它。它与广泛的仿真研究和实际数据中的现有技术相比有利。
Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.