培训均匀时间的过多散热性神经网络

论文标题

培训均匀时间的过多散热性神经网络

Training Overparametrized Neural Networks in Sublinear Time

论文作者

Deng, Yichuan, Hu, Hang, Song, Zhao, Weinstein, Omri, Zhuo, Danyang

论文摘要

深度学习的成功以巨大的计算和能源成本，训练的可伸缩性大规模散热性神经网络正成为人工智能进步（AI）的真正障碍。尽管通过梯度不错的传统反向传播的流行和低成本率，但在理论和实践中，随机梯度下降（SGD）在非convex设置中具有良好的收敛速度。为了减轻这一成本，最近的作品提议采用替代（牛顿型）培训方法，其收敛速度要快得多，尽管其每题成本较高。对于具有$ m = \ mathrm {poly}（n）$参数的典型神经网络，$ n $ datapoints in $ \ mathbb {r}^d $ in $ \ mathbb {r}^d $，这是[brand，peng，song和weinstein的先前工作在本文中，我们提出了一种新颖的培训方法，该方法仅需要$ m^{1-α} n d + n^3 $摊销时间在相同的过份术中，其中$α\ in（0.01,1）$是固定常数。此方法依赖于神经网络的新替代视图，作为一组二进制搜索树，每个迭代都对应于修改树中节点的一小部分。我们认为，这种观点将在深度神经网络（DNNS）的设计和分析中进一步应用。

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-α} n d + n^3$ amortized time in the same overparametrized regime, where $α\in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

下载PDF全文

下载文献需遵守相关版权规定

论文标题