论文标题
在线自适应和经常性优化算法的收敛性
Convergence of Online Adaptive and Recurrent Optimization Algorithms
论文作者
论文摘要
我们证明了在机器学习中使用的几种明显的梯度下降算法的局部收敛,标准随机梯度下降理论不直接适用。首先,包括\ emph {实时重复学习}(rtrl)及其计算较轻的近似值NobackTrack和Uoro的在线算法;其次,几种自适应算法,例如RMSPROP,在线天然梯度和$β^2 \至1 $的Adam。对这些算法的分析并未立即遵循标准随机梯度(SGD)理论。 实际上,在某些简单情况下,亚当缺乏局部收敛\ citep {j.2018on}。对于经常性模型,在线算法在运行模型时会修改参数,这使这些各种算法的局部收敛进一步使分析复杂化,这是由单个,更一般的假设集合在网上学习动态系统的设置中。因此,这些结果可以涵盖所考虑的算法的其他变体。我们采用“千古来”而不是概率的观点,使用经验时间平均而不是概率分布。这是更多的数据不足,并且在标准SGD理论方面产生了差异,尤其是对于可能的学习率范围。例如,随着循环或每期循环在有限的数据集上进行重新封装,而不是纯i.i.d. \ fliplace \替换,梯度的经验平均值以$ 1/t $的价格收敛,而不是$ 1/\ sqrt {t} $(循环用作降低的方法),用于减少速度的学习率。
We prove local convergence of several notable gradient descent algorithms used in machine learning, for which standard stochastic gradient descent theory does not apply directly. This includes, first, online algorithms for recurrent models and dynamical systems, such as \emph{Real-time recurrent learning} (RTRL) and its computationally lighter approximations NoBackTrack and UORO; second, several adaptive algorithms such as RMSProp, online natural gradient, and Adam with $β^2\to 1$.Despite local convergence being a relatively weak requirement for a new optimization algorithm, no local analysis was available for these algorithms, as far as we knew. Analysis of these algorithms does not immediately follow from standard stochastic gradient (SGD) theory. In fact, Adam has been proved to lack local convergence in some simple situations \citep{j.2018on}. For recurrent models, online algorithms modify the parameter while the model is running, which further complicates the analysis with respect to simple SGD.Local convergence for these various algorithms results from a single, more general set of assumptions, in the setup of learning dynamical systems online. Thus, these results can cover other variants of the algorithms considered.We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions. This is more data-agnostic and creates differences with respect to standard SGD theory, especially for the range of possible learning rates. For instance, with cycling or per-epoch reshuffling over a finite dataset instead of pure i.i.d.\ sampling with replacement, empirical averages of gradients converge at rate $1/T$ instead of $1/\sqrt{T}$ (cycling acts as a variance reduction method), theoretically allowing for larger learning rates than in SGD.