通过梯度转化的特征美白，以改善收敛性

论文标题

通过梯度转化的特征美白，以改善收敛性

Feature Whitening via Gradient Transformation for Improved Convergence

论文作者

Markovich-Golan, Shmulik, Battash, Barak, Bleiweiss, Amit

论文摘要

特征美白是一种已知的技术，用于加快对DNN的培训。在某些假设下，使激活变白可将Fisher信息矩阵降低为简单的身份矩阵，在这种情况下，随机梯度下降等同于更快的自然梯度下降。由于转换层输入及其相应梯度在前进和向后传播中产生的额外复杂性，并且反复计算特征值分解（EVD），因此此方法不常用于日期。在这项工作中，我们解决了功能美白的复杂性缺陷。我们的贡献是双重的。首先，我们得出了一种等效方法，该方法将样本转换替换为重量梯度的转换，该方法应用于每批B样品。复杂性降低了s =（2b）的因子，其中s表示层输出的特征维度。随着批次尺寸随着分布式培训的增加而增加，使用提出的方法的好处变得更加引人注目。其次，是由样品协方差矩阵的条件数与收敛速度之间的理论关系激励的，我们得出了一种替代性的亚最佳算法，该算法递归减少了后一种矩阵的条件数。与EVD相比，复杂性通过输入特征维度M的一个因子降低。我们用基于RESNET的网络为提出的算法示例用于CIFAR和Imagenet数据集上的图像分类。并行化提出的算法很简单，我们实现了其分布式版本。在我们的实验中可以观察到改善的收敛性，从速度和达到的准确性方面可以观察到。

Feature whitening is a known technique for speeding up training of DNN. Under certain assumptions, whitening the activations reduces the Fisher information matrix to a simple identity matrix, in which case stochastic gradient descent is equivalent to the faster natural gradient descent. Due to the additional complexity resulting from transforming the layer inputs and their corresponding gradients in the forward and backward propagation, and from repeatedly computing the Eigenvalue decomposition (EVD), this method is not commonly used to date. In this work, we address the complexity drawbacks of feature whitening. Our contribution is twofold. First, we derive an equivalent method, which replaces the sample transformations by a transformation to the weight gradients, applied to every batch of B samples. The complexity is reduced by a factor of S=(2B), where S denotes the feature dimension of the layer output. As the batch size increases with distributed training, the benefit of using the proposed method becomes more compelling. Second, motivated by the theoretical relation between the condition number of the sample covariance matrix and the convergence speed, we derive an alternative sub-optimal algorithm which recursively reduces the condition number of the latter matrix. Compared to EVD, complexity is reduced by a factor of the input feature dimension M. We exemplify the proposed algorithms with ResNet-based networks for image classification demonstrated on the CIFAR and Imagenet datasets. Parallelizing the proposed algorithms is straightforward and we implement a distributed version thereof. Improved convergence, in terms of speed and attained accuracy, can be observed in our experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题