论文标题
训练batchnorm和唯一的batchnorm:在CNN中随机特征的表达能力上
Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
论文作者
论文摘要
从样式转移到多任务学习的各种深度学习技术都取决于培训功能的仿射转换。其中最突出的是流行的特征归一化技术batchnorm,它使激活归一化,然后应用学习的仿射变换。在本文中,我们旨在了解以这种方式改变特征的仿射参数的作用和表达能力。为了将这些参数的贡献与它们所转换的学识渊博特征的贡献隔离,我们研究了仅在batchnorm中训练这些参数并以随机初始化的所有权重冻结时所达到的性能。考虑到这种训练风格施加的重大局限性,这样做会导致出人意料的高性能。例如,在这种配置中,足够深的重置达到82%(CIFAR-10)和32%(ImageNet,TOP-5)的精度,远高于训练网络其他地方的同等数量的随机选择参数时。 Batchnorm自然地学习在三分之一的随机特征上禁用,从而实现了这一表现。这些结果不仅强调了深度学习中仿射参数的表达能力,而且 - 从更广泛的意义上讲,它们表征了简单地通过转移和重新恢复随机特征而构建的神经网络的表现力。
A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.