论文标题
了解标准化层的概括益处:降低清晰度
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
论文作者
论文摘要
引入了归一化层(例如,批处理归一化,层归一化),以帮助在非常深的网中遇到优化困难,但它们显然也有助于概括,即使在不太深入的网中也是如此。由于长期以来的信念,即最小值会导致更好的概括,本文提供了数学分析和支持实验,这表明归一化(与伴随的重量 - 赛车一起)鼓励GD降低损失表面的清晰度。鉴于损失是标准不变的,这是标准化的已知结果,因此仔细地定义了“清晰度”。具体而言,对于具有归一化的相当广泛的神经网络,我们的理论解释了有限学习率的GD如何进入所谓的稳定边缘(EOS)制度,并通过连续减少锐度 - 减少流量来表征GD的轨迹。
Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.