带有随机polyak步骤的SGD的动力学：真正的自适应变体和融合到精确的解决方案

论文标题

带有随机polyak步骤的SGD的动力学：真正的自适应变体和融合到精确的解决方案

Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution

论文作者

Orvieto, Antonio, Lacoste-Julien, Simon, Loizou, Nicolas

论文摘要

最近，Loizou等。（2021），提出和分析了随机多乘步量（SPS）的随机梯度下降（SGD）。拟议的SPS具有强大的融合保证和竞争性能。但是，当它在非跨参数化的方案中使用时，它具有两个主要缺点：（i）它需要先验了解最佳的迷你批次损失，当不满足插值条件（例如，正则化目标）时，这些损失不可用，并且（ii）仅保证与解决方案的邻域融合。在这项工作中，我们研究了配备了随机Polyak Stepize的新变体的SGD的动力学和收敛性，并为原始SPS的两个缺点提供了解决方案。我们首先表明，使用下限而不是最佳函数值的原始SP进行了简单的修改，可以直接解决问题（i）。另一方面，解决问题（II）证明更具挑战性，并使我们对该方法的行为有宝贵的见解。我们表明，如果不满足插值，SPS和随机梯度之间的相关性会引入偏见，这有效地扭曲了最小值附近的梯度信号的期望，即使在训练过程中缩放了步骤尺寸，也会导致非交流。为了解决此问题，我们提出了对SPS的新颖修改DECSP，可确保与确切的最小化器的收敛 - 没有问题参数的先验知识。对于强征优化问题，DECSP是第一个随机自适应优化方法，它会收敛到精确解决方案而无需限制性假设，例如有限的迭代/梯度。

Recently, Loizou et al. (2021), proposed and analyzed stochastic gradient descent (SGD) with stochastic Polyak stepsize (SPS). The proposed SPS comes with strong convergence guarantees and competitive performance; however, it has two main drawbacks when it is used in non-over-parameterized regimes: (i) It requires a priori knowledge of the optimal mini-batch losses, which are not available when the interpolation condition is not satisfied (e.g., regularized objectives), and (ii) it guarantees convergence only to a neighborhood of the solution. In this work, we study the dynamics and the convergence properties of SGD equipped with new variants of the stochastic Polyak stepsize and provide solutions to both drawbacks of the original SPS. We first show that a simple modification of the original SPS that uses lower bounds instead of the optimal function values can directly solve issue (i). On the other hand, solving issue (ii) turns out to be more challenging and leads us to valuable insights into the method's behavior. We show that if interpolation is not satisfied, the correlation between SPS and stochastic gradients introduces a bias, which effectively distorts the expectation of the gradient signal near minimizers, leading to non-convergence - even if the stepsize is scaled down during training. To fix this issue, we propose DecSPS, a novel modification of SPS, which guarantees convergence to the exact minimizer - without a priori knowledge of the problem parameters. For strongly-convex optimization problems, DecSPS is the first stochastic adaptive optimization method that converges to the exact solution without restrictive assumptions like bounded iterates/gradients.

下载PDF全文

下载文献需遵守相关版权规定

论文标题