重叠本地-SGD：一种隐藏分布式SGD中通信延迟的算法方法

论文标题

重叠本地-SGD：一种隐藏分布式SGD中通信延迟的算法方法

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

论文作者

Wang, Jianyu, Liang, Hao, Joshi, Gauri

论文摘要

分布式随机梯度下降（SGD）对于将机器学习算法缩放到大量计算节点至关重要。但是，基础架构的可变性（例如高通信延迟或随机节点减速）极大地阻碍了分布式SGD算法的性能，尤其是在无线系统或传感器网络中。在本文中，我们提出了一种算法方法，称为“重叠 - 局部-SGD”（及其动量变体），以重叠通信和计算，以加速分布式训练程序。这种方法也可以帮助减轻散曲效应。我们通过在每个节点上添加锚模型来实现这一目标。经过多次本地更新后，将将经过本地训练的型号撤回同步锚模型，而不是与他人进行通信。在CIFAR-10数据集上训练深神经网络的实验结果证明了重叠 - 局部SGD的有效性。我们还为在非凸目标函数下为拟议算法提供了收敛保证。

Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown greatly impedes the performance of distributed SGD algorithm, especially in a wireless system or sensor networks. In this paper, we propose an algorithmic approach named Overlap-Local-SGD (and its momentum variant) to overlap the communication and computation so as to speedup the distributed training procedure. The approach can help to mitigate the straggler effects as well. We achieve this by adding an anchor model on each node. After multiple local updates, locally trained models will be pulled back towards the synchronized anchor model rather than communicating with others. Experimental results of training a deep neural network on CIFAR-10 dataset demonstrate the effectiveness of Overlap-Local-SGD. We also provide a convergence guarantee for the proposed algorithm under non-convex objective functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题