深度神经网络中学习的异常扩散动力学

论文标题

深度神经网络中学习的异常扩散动力学

Anomalous diffusion dynamics of learning in deep neural networks

论文作者

Chen, Guozhang, Qu, Cheng Kevin, Gong, Pulin

论文摘要

深度神经网络（DNN）中的学习是通过最小化高度非凸损失函数来实现的，通常是通过随机梯度下降（SGD）方法来实现的。这种学习过程可以有效地找到良好的最小值，而不会被困在贫穷的当地较弱的过程中。我们介绍了这种有效的深度学习如何通过SGD的相互作用和损失格局的几何结构出现的新颖说法。我们发现SGD在通过损失景观中导航时表现出丰富，复杂的动力学，而不是经常假设的正常扩散过程（即布朗运动）。最初，SGD表现出异常的超截留，该超扩散会逐渐减弱，并在达到溶液时长时间变化。这种学习动态在不同的DNN（例如Resnet和类似VGG的网络）中无处不在，对批处理大小和学习率不敏感。初始学习阶段的异常超扩散过程表明，SGD沿损失景观的运动具有间歇性的大跳跃。这种非平衡性能使SGD能够从尖锐的局部最小值中逃脱。通过调整用于研究复杂物理系统中能量景观的方法，我们发现这种超级潜在的学习动力学是由于SGD的相互作用和损失景观的分形结构的相互作用所致。我们进一步开发了一个简单的模型，以证明分形损失格局在使SGD有效地找到全局最小值方面的机械作用。因此，我们的结果从新颖的角度揭示了深度学习的有效性，并对设计有效的深神经网络具有影响。

Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-convex loss function, typically by a stochastic gradient descent (SGD) method. This learning process can effectively find good wide minima without being trapped in poor local ones. We present a novel account of how such effective deep learning emerges through the interactions of the SGD and the geometrical structure of the loss landscape. Rather than being a normal diffusion process (i.e. Brownian motion) as often assumed, we find that the SGD exhibits rich, complex dynamics when navigating through the loss landscape; initially, the SGD exhibits anomalous superdiffusion, which attenuates gradually and changes to subdiffusion at long times when the solution is reached. Such learning dynamics happen ubiquitously in different DNNs such as ResNet and VGG-like networks and are insensitive to batch size and learning rate. The anomalous superdiffusion process during the initial learning phase indicates that the motion of SGD along the loss landscape possesses intermittent, big jumps; this non-equilibrium property enables the SGD to escape from sharp local minima. By adapting the methods developed for studying energy landscapes in complex physical systems, we find that such superdiffusive learning dynamics are due to the interactions of the SGD and the fractal-like structure of the loss landscape. We further develop a simple model to demonstrate the mechanistic role of the fractal loss landscape in enabling the SGD to effectively find global minima. Our results thus reveal the effectiveness of deep learning from a novel perspective and have implications for designing efficient deep neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题