基于SGD学习的批处理大小扩大

论文标题

基于SGD学习的批处理大小扩大

Stagewise Enlargement of Batch Size for SGD-based Learning

论文作者

Zhao, Shen-Yi, Xie, Yin-Peng, Li, Wu-Jun

论文摘要

现有研究表明，批处理大小可以严重影响基于训练速度和泛化能力，包括基于训练速度和泛化能力的随机梯度下降〜（SGD）的性能。较大的批次大小通常会导致参数更新更少。在分布式培训中，较大的批量尺寸也导致通信较少。但是，较大的批量大小可以使概括差距更容易。因此，如何为SGD设置适当的批量大小最近引起了很多关注。尽管已经提出了一些有关设置批处理大小的方法，但批处理大小问题仍未得到很好的解决。在本文中，我们首先提供理论，以表明适当的批量大小与模型参数的初始化和最佳之间的差距有关。然后，基于该理论，我们提出了一种新方法，称为\下划线{s} tagewise \ unessline {e} nlargement \ usepline {b} atch \ aTCH \ usewissline {s} ize〜（\ mbox {sebs {sebs}），以设置适合sgd的批次大小。更具体地说，\ mbox {sebs}采用多阶段方案，并通过阶段将批处理大小扩大。从理论上讲，与经典的阶段SGD相比，降低学习率的经典阶段SGD相比，\ mbox {sebs}可以减少参数更新的数量而不会增加概括误差。 SEB适用于\ Mbox {SGD}，动量\ Mbox {SGD}和Adagrad。真实数据的经验结果成功验证了\ Mbox {Sebs}的理论。此外，经验结果还表明，SEB可以胜过其他基线。

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题