论文标题
GCWSNET:通用一致的加权采样,以进行神经网络的可扩展和准确培训
GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks
论文作者
论文摘要
我们开发了用于哈希“ Powered-GMM”(PGMM)内核(带有调音参数$ P $)的“广义一致的加权采样”(GCWS)。事实证明,GCWS提供了一个数值稳定的方案,用于在原始数据上应用功率转换,而不论$ p $和数据的幅度如何。在许多情况下,功率转换通常对于提高性能有效。我们将Hashed数据馈送到各种公共分类数据集中的神经网络中,并将我们的方法命名``gcwsnet''。我们广泛的实验表明,GCWSNET通常提高了分类准确性。此外,从实验中可以明显看出,GCWSNet收敛速度更快。实际上,GCW通常仅通过(少于)训练过程的一个时代达到合理的精度。此属性是非常需要的,因为许多应用程序,例如广告点击率(CTR)预测模型或数据流(即仅看到一次数据),通常只训练一个时代。另一个有益的副作用是,神经网络的第一层的计算成为添加而不是乘法,因为输入数据变为二进制(且高度稀疏)。 提供了与(归一化)随机傅立叶特征(NRFF)的经验比较。我们还建议通过计数 - 佐剂来降低GCWSNET的模型大小,并开发出分析使用计数 - ketch对GCWS准确性的影响的理论。我们的分析表明,``8位''策略应该效果很好,因为我们可以始终在GCW散列的输出中应用8位计数 - 挑战,而不会损害准确性。在训练深层神经网络时,还有许多其他方法可以利用GCW。例如,可以在最后一层的输出上应用GCW,以提高训练有素的深神经网络的准确性。
We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.