Adaterm：自适应T-分布估计噪声随机梯度优化的稳健力矩

论文标题

Adaterm：自适应T-分布估计噪声随机梯度优化的稳健力矩

AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization

论文作者

Ilboudo, Wendyam Eric Lionel, Kobayashi, Taisuke, Matsubara, Takamitsu

论文摘要

随着深度学习应用的越来越多的实用性，从业人员不可避免地面临着由于各种来源的噪声（例如测量错误，标签错误和估计的替代输入/输出）的噪声而损坏的数据集，这些数据集可能会对优化结果产生不利影响。改善优化算法对噪声的鲁棒性是一种常见的做法，因为该算法最终负责更新网络参数。先前的研究表明，可以根据学生的T分布来修改Adam样随机梯度下降优化器中使用的一阶时刻。尽管这种修改导致了抗噪声的更新，但其他相关的统计数据保持不变，导致假定模型的不一致。在本文中，我们提出了一种新颖的方法，该方法结合了学生的T-分布，不仅可以得出一阶时刻，还可以得出所有相关的统计数据。这提供了对优化过程的统一处理，并首次根据T分布的统计模型提供了一个全面的框架。提出的方法比以前提出的方法具有多个优势，包括减少的超参数以及提高的鲁棒性和适应性。这种噪声自适应行为有助于Adaterm的出色学习表现，如不同和/或未知噪声比的各种优化问题所证明的。此外，我们引入了一种新技术，用于导致理论遗憾的束缚而不依赖Amsgrad，为该领域提供了宝贵的贡献

With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. This noise-adaptive behavior contributes to AdaTerm's exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field

下载PDF全文

下载文献需遵守相关版权规定

论文标题