论文标题

培训均匀时间的过多散热性神经网络

Training Overparametrized Neural Networks in Sublinear Time

论文作者

Deng, Yichuan, Hu, Hang, Song, Zhao, Weinstein, Omri, Zhuo, Danyang

论文摘要

深度学习的成功以巨大的计算和能源成本,训练的可伸缩性大规模散热性神经网络正成为人工智能进步(AI)的真正障碍。尽管通过梯度不错的传统反向传播的流行和低成本率,但在理论和实践中,随机梯度下降(SGD)在非convex设置中具有良好的收敛速度。 为了减轻这一成本,最近的作品提议采用替代(牛顿型)培训方法,其收敛速度要快得多,尽管其每题成本较高。对于具有$ m = \ mathrm {poly}(n)$参数的典型神经网络,$ n $ datapoints in $ \ mathbb {r}^d $ in $ \ mathbb {r}^d $,这是[brand,peng,song和weinstein的先前工作在本文中,我们提出了一种新颖的培训方法,该方法仅需要$ m^{1-α} n d + n^3 $摊销时间在相同的过份术中,其中$α\ in(0.01,1)$是固定常数。此方法依赖于神经网络的新替代视图,作为一组二进制搜索树,每个迭代都对应于修改树中节点的一小部分。我们认为,这种观点将在深度神经网络(DNNS)的设计和分析中进一步应用。

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-α} n d + n^3$ amortized time in the same overparametrized regime, where $α\in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源