高斯和非高斯数据扩展的普遍性

论文标题

高斯和非高斯数据扩展的普遍性

Gaussian and Non-Gaussian Universality of Data Augmentation

论文作者

Huang, Kevin Han, Orbanz, Peter, Austern, Morgane

论文摘要

我们提供了普遍性结果，以量化数据增强如何通过简单的替代物影响估计值的方差和限制分布，并详细分析几个特定模型。结果证实了在机器学习实践中的一些观察结果，但也导致了意外的发现：数据增加可能会增加而不是减少估计值的不确定性，例如经验预测风险。它可以充当正规器，但在某些高维问题中未能这样做，并且可能会改变经验风险的双重峰值。总体而言，分析表明，已经归因于对或错的几种属性数据扩展，而是取决于因素的组合 - 尤其是数据分布，估计器的属性以及样本量，增强数量和维度的相互作用。作为我们的主要理论工具，我们开发了Lindeberg技术依赖技术的改编。由此产生的普遍性制度可能是高斯或非高斯。

We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg's technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.

下载PDF全文

下载文献需遵守相关版权规定

论文标题