论文标题
Adaterm:自适应T-分布估计噪声随机梯度优化的稳健力矩
AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization
论文作者
论文摘要
随着深度学习应用的越来越多的实用性,从业人员不可避免地面临着由于各种来源的噪声(例如测量错误,标签错误和估计的替代输入/输出)的噪声而损坏的数据集,这些数据集可能会对优化结果产生不利影响。改善优化算法对噪声的鲁棒性是一种常见的做法,因为该算法最终负责更新网络参数。先前的研究表明,可以根据学生的T分布来修改Adam样随机梯度下降优化器中使用的一阶时刻。尽管这种修改导致了抗噪声的更新,但其他相关的统计数据保持不变,导致假定模型的不一致。在本文中,我们提出了一种新颖的方法,该方法结合了学生的T-分布,不仅可以得出一阶时刻,还可以得出所有相关的统计数据。这提供了对优化过程的统一处理,并首次根据T分布的统计模型提供了一个全面的框架。提出的方法比以前提出的方法具有多个优势,包括减少的超参数以及提高的鲁棒性和适应性。这种噪声自适应行为有助于Adaterm的出色学习表现,如不同和/或未知噪声比的各种优化问题所证明的。此外,我们引入了一种新技术,用于导致理论遗憾的束缚而不依赖Amsgrad,为该领域提供了宝贵的贡献
With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. This noise-adaptive behavior contributes to AdaTerm's exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field