有限时期后，自适应梯度方法可以比SGD快得多

论文标题

有限时期后，自适应梯度方法可以比SGD快得多

Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs

论文作者

Huang, Xunpeng, Zhou, Hao, Xu, Runxin, Wang, Zhe, Li, Lei

论文摘要

由于高效率，自适应梯度方法吸引了机器学习社区的广泛关注。然而，从理论上讲，它们在实践中的加速效应，尤其是在神经网络培训中，很难分析。理论融合结果与实际表现之间的巨大差距阻止了对现有优化者的进一步了解以及更高级优化方法的发展。在本文中，我们为自适应梯度方法提供了一种新的分析，并具有额外的轻度假设，并将Adagrad修改为\ Radagrad，以匹配更好的可证明的收敛速率。要在非convex目标中找到$ε$ - 偏用的一阶固定点，我们证明随机改组\ radagrad实现了$ \ tilde {o}（t^{ - 1/2}）$收敛速度，这是由因子$ \ tilde {o}（o}（o}（o}（t^-1/4}），这是显着提高的$ \ tilde {o}（t^{ - 1/6}）$分别与现有的自适应梯度方法和随机洗牌SGD相比。据我们所知，这是第一次证明自适应梯度方法在有限时期后可以比SGD更快。此外，我们进行了全面的实验，以验证第二次矩和随机改组受益的其他轻度假设和加速效应。

Adaptive gradient methods have attracted much attention of machine learning communities due to the high efficiency. However their acceleration effect in practice, especially in neural network training, is hard to analyze, theoretically. The huge gap between theoretical convergence results and practical performances prevents further understanding of existing optimizers and the development of more advanced optimization methods. In this paper, we provide adaptive gradient methods a novel analysis with an additional mild assumption, and revise AdaGrad to \radagrad for matching a better provable convergence rate. To find an $ε$-approximate first-order stationary point in non-convex objectives, we prove random shuffling \radagrad achieves a $\tilde{O}(T^{-1/2})$ convergence rate, which is significantly improved by factors $\tilde{O}(T^{-1/4})$ and $\tilde{O}(T^{-1/6})$ compared with existing adaptive gradient methods and random shuffling SGD, respectively. To the best of our knowledge, it is the first time to demonstrate that adaptive gradient methods can deterministically be faster than SGD after finite epochs. Furthermore, we conduct comprehensive experiments to validate the additional mild assumption and the acceleration effect benefited from second moments and random shuffling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题