从理论上讲，为什么SGD在深度学习中比亚当更好地概括

论文标题

从理论上讲，为什么SGD在深度学习中比亚当更好地概括

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

论文作者

Zhou, Pan, Feng, Jiashi, Ma, Chao, Xiong, Caiming, Hoi, Steven, E, Weinan

论文摘要

目前尚不清楚为什么尽管训练速度更快，但为什么类似亚当的适应性梯度算法比SGD遭受的概括性能差。这项工作旨在通过分析其本地收敛行为来提供对这一概括差距的理解。具体而言，我们观察到这些算法中梯度噪声的沉重尾巴。这激发了我们通过征收驱动的随机微分方程（SDE）来分析这些算法，因为算法及其SDE的收敛行为相似。然后，我们从当地盆地建立了这些SDE的逃避时间。结果表明，（1）SGD和ADAM〜的逃逸时间取决于盆地的ra尺度，梯度噪声的重度负面影响；（2）对于同一盆地，SGD的逃逸时间比Adam较小，主要是因为（a）Adam〜的几何适应性通过自适应缩放每个梯度坐标良好的梯度噪声中的各向异性结构良好，并导致较大的盆地radon尺寸。（b）亚当〜的指数梯度平均值平滑其梯度，并导致比SGD更轻的梯度噪声尾巴。因此，SGD比在尖锐的最小值中比亚当〜局部更加不稳定，其最小值的当地盆地具有较小的rad尺度，并且可以更好地从它们逃脱到具有较大ra的尺寸的平坦。正如这里通常指的是平坦的最小值，通常指的是平坦或不对称盆地/山谷的最小值，通常比尖锐的最小值更好，我们的结果解释了SGD比Adam的概括性能更好。最后，实验结果证实了我们重尾梯度噪声假设和理论肯定。

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones , our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题