使用异步坐标梯度下降从马鞍点逃脱

论文标题

使用异步坐标梯度下降从马鞍点逃脱

Escaping From Saddle Points Using Asynchronous Coordinate Gradient Descent

论文作者

Bornstein, Marco, Liu, Jin-Peng, Li, Jingling, Huang, Furong

论文摘要

大规模的非凸优化问题由于计算和记忆成本而无法解决。为了降低成本，为了最大程度地减少机器学习中的非convex功能，必要的是一阶（计算效率）和异步平行（内存效率）算法。但是，在非convex设置中应用的异步一阶方法遇到了两个困难：（i）并行延迟，这些延迟通过破坏一阶方法的单调性来影响收敛，（ii）亚匹配鞍点在梯度为零的位置。为了解决这两个困难，我们提出了一种异步坐标梯度散热算法，这些算法显示出以有界的延迟收敛到局部最小值的。我们的算法通过使用精心构造的哈密顿功能来克服并行的问题。我们证明，在哈密顿量中纳入我们设计的动力学项，使我们的算法可以通过迭代单调减少。接下来，我们的算法通过使用扰动子例程来避免迭代鞍点。类似于其他最先进的（SOTA）算法，我们相对于尺寸达到了多同源的收敛速率。与同步的其他SOTA算法不同，我们的工作是第一个研究并行化延迟如何影响异步一阶算法的收敛速率的工作。我们证明，我们的算法在较大的并行延迟下优于同步对应物，取决于相对于延迟的互惠。据我们所知，这是针对非convex设置的一阶异步算法的第一个局部最佳收敛结果。

Large-scale non-convex optimization problems are expensive to solve due to computational and memory costs. To reduce the costs, first-order (computationally efficient) and asynchronous-parallel (memory efficient) algorithms are necessary to minimize non-convex functions in machine learning. However, asynchronous-first-order methods applied within non-convex settings run into two difficulties: (i) parallelization delays, which affect convergence by disrupting the monotonicity of first-order methods, and (ii) sub-optimal saddle points where the gradient is zero. To solve these two difficulties, we propose an asynchronous-coordinate-gradient-descent algorithm shown to converge to local minima with a bounded delay. Our algorithm overcomes parallelization-delay issues by using a carefully constructed Hamiltonian function. We prove that our designed kinetic-energy term, incorporated within the Hamiltonian, allows our algorithm to decrease monotonically per iteration. Next, our algorithm steers iterates clear of saddle points by utilizing a perturbation sub-routine. Similar to other state-of-the-art (SOTA) algorithms, we achieve a poly-logarithmic convergence rate with respect to dimension. Unlike other SOTA algorithms, which are synchronous, our work is the first to study how parallelization delays affect the convergence rate of asynchronous first-order algorithms. We prove that our algorithm outperforms synchronous counterparts under large parallelization delays, with convergence depending sublinearly with respect to delays. To our knowledge, this is the first local optima convergence result of a first-order asynchronous algorithm for non-convex settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题