论文标题
在神经网络训练的良好的非最佳关键点:梯度下降优化,一个随机初始化克服了所有不良的非全球局部局部最小值,概率很高
Convergence to good non-optimal critical points in the training of neural networks: Gradient descent optimization with one random initialization overcomes all bad non-global local minima with high probability
论文作者
论文摘要
如今,用于培训人工神经网络(ANN)的梯度下降(GD)方法属于数字世界中最严重的计算方案。尽管这种方法取得了令人信服的成功,但为GD方法在ANN训练中的成功提供了严格的理论理由仍然是一个开放的问题。主要困难是,与ANN相关的优化风险景观通常承认许多非最佳关键点(马鞍点以及非全球局部最小值),其风险值严格大于最佳风险价值。在某些简化的浅安训练情况下克服这一障碍是本文的关键贡献。在如此简化的ANN训练场景中,我们证明,只有一个随机初始化的梯度流动(GF)具有很高的概率,所有不良的非全球局部局部最小值(所有非全球局部局部最小值)的风险价值比全球最小值的风险价值大得多),并且具有良好的风险价值的高概率(最接近的风险价值)是一定的,即将到来的一定的风险价值是一个很小的风险价值。这种分析使我们能够以趋于的概率为零,以将ANN训练时间和ANN的宽度增加到无穷大的GF轨迹的风险价值的零。我们对这项工作的分析结果进行了对浅和深环体的广泛数值模拟的补充:所有这些数值模拟都强烈表明,考虑到考虑的GD方法(随机GD或ADAM)克服了所有不良的非全球局部局部最小值,但确实融合到了一个全球的最低限度,但最终的风险很高。
Gradient descent (GD) methods for the training of artificial neural networks (ANNs) belong nowadays to the most heavily employed computational schemes in the digital world. Despite the compelling success of such methods, it remains an open problem to provide a rigorous theoretical justification for the success of GD methods in the training of ANNs. The main difficulty is that the optimization risk landscapes associated to ANNs usually admit many non-optimal critical points (saddle points as well as non-global local minima) whose risk values are strictly larger than the optimal risk value. It is a key contribution of this article to overcome this obstacle in certain simplified shallow ANN training situations. In such simplified ANN training scenarios we prove that the gradient flow (GF) dynamics with only one random initialization overcomes with high probability all bad non-global local minima (all non-global local minima whose risk values are much larger than the risk value of the global minima) and converges with high probability to a good critical point (a critical point whose risk value is very close to the optimal risk value of the global minima). This analysis allows us to establish convergence in probability to zero of the risk value of the GF trajectories with convergence rates as the ANN training time and the width of the ANN increase to infinity. We complement the analytical findings of this work with extensive numerical simulations for shallow and deep ANNs: All these numerical simulations strongly suggest that with high probability the considered GD method (stochastic GD or Adam) overcomes all bad non-global local minima, does not converge to a global minimum, but does converge to a good non-optimal critical point whose risk value is very close to the optimal risk value.