论文标题
关于神经网络和一般非线性圆锥近似方案的平方损失训练问题的稳定性特性和优化领域
On the Stability Properties and the Optimization Landscape of Training Problems with Squared Loss for Neural Networks and General Nonlinear Conic Approximation Schemes
论文作者
论文摘要
我们研究了神经网络和一般非线性圆锥近似方案的训练问题的优化格局和稳定性。可以证明,如果认为(从适当定义的意义上)认为非线性圆锥近似方案比经典的线性近似方法更具表现力,并且如果存在无法实现的标记向量,那么与平方损失的训练问题必然是不稳定的,因为其解决方案集合在培训数据中不连续地取决于培训数据。我们进一步证明,造成这些不稳定特性的相同效果也是出现马鞍点和虚假的本地最小值的原因,这些效果可能与全球解决方案任意远离,并且训练问题的不稳定性和虚假的局部最小值的存在都不能通过将目标函数添加到目标函数中,从而使尺寸的尺寸添加到近似尺寸的范围中,从而使当地的最小值能够越来越多。无论是否满足实现性的假设,后者的结果被证明是真实的。 We demonstrate that our analysis in particular applies to training problems for free-knot interpolation schemes and deep and shallow neural networks with variable widths that involve an arbitrary mixture of various activation functions (e.g., binary, sigmoid, tanh, arctan, soft-sign, ISRU, soft-clip, SQNL, ReLU, leaky ReLU, soft-plus, bent identity, SILU, ISRLU,和Elu)。总而言之,本文的发现表明,神经网络和一般非线性圆锥近似工具的改善近似特性以直接且可量化的方式连接到必须解决的优化问题的不良属性,以训练它们。
We study the optimization landscape and the stability properties of training problems with squared loss for neural networks and general nonlinear conic approximation schemes. It is demonstrated that, if a nonlinear conic approximation scheme is considered that is (in an appropriately defined sense) more expressive than a classical linear approximation approach and if there exist unrealizable label vectors, then a training problem with squared loss is necessarily unstable in the sense that its solution set depends discontinuously on the label vector in the training data. We further prove that the same effects that are responsible for these instability properties are also the reason for the emergence of saddle points and spurious local minima, which may be arbitrarily far away from global solutions, and that neither the instability of the training problem nor the existence of spurious local minima can, in general, be overcome by adding a regularization term to the objective function that penalizes the size of the parameters in the approximation scheme. The latter results are shown to be true regardless of whether the assumption of realizability is satisfied or not. We demonstrate that our analysis in particular applies to training problems for free-knot interpolation schemes and deep and shallow neural networks with variable widths that involve an arbitrary mixture of various activation functions (e.g., binary, sigmoid, tanh, arctan, soft-sign, ISRU, soft-clip, SQNL, ReLU, leaky ReLU, soft-plus, bent identity, SILU, ISRLU, and ELU). In summary, the findings of this paper illustrate that the improved approximation properties of neural networks and general nonlinear conic approximation instruments are linked in a direct and quantifiable way to undesirable properties of the optimization problems that have to be solved in order to train them.