门控辍学：稀疏激活变压器的沟通效率正规化

论文标题

门控辍学：稀疏激活变压器的沟通效率正规化

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

论文作者

Liu, Rui, Kim, Young Jin, Muzio, Alexandre, Awadalla, Hany Hassan

论文摘要

稀疏激活的变压器（例如专家的混合物（MOE））由于其极端的缩放能力而引起了极大的兴趣，这可以使模型大小的巨大增加而没有大幅增加计算成本。为了实现这一目标，MOE模型用变压器中的Experts子层取代了前馈子层，并使用门控网络将每个令牌路由到其指定的专家。由于对此类模型进行有效培训的常见实践需要在不同的机器上分发专家和令牌，因此该路由策略通常会产生巨大的跨机器通信成本，因为代币及其分配的专家可能居住在不同的机器中。在本文中，我们提出了\ emph {门控辍学}，该}允许令牌忽略门控网络并留在其本地机器，从而减少了跨机器通信。与传统辍学类似，我们还表明，门控辍学在训练过程中具有正规化效果，从而改善了概括性能。我们验证了对多语言机器翻译任务中门控辍学的有效性。我们的结果表明，门控辍学改善了具有更快的墙壁电流时间收敛速率的最先进的MOE模型，并且对于各种型号和数据集，可以改善墙壁锁定率和更好的BLEU分数。

Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Dropout}, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance. We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art MoE model with faster wall-clock time convergence rates and better BLEU scores for a variety of model sizes and datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题