论文标题
Nebula-I:一个关于低型带宽云集群的协作培训深度学习模型的一般框架
Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters
论文作者
论文摘要
不断增长的模型大小和计算规模引起了人们对多个节点的深度学习模型的不断增长的兴趣。但是,当涉及到云簇的训练时,尤其是在远程集群中,面临着巨大的挑战。在这项工作中,我们介绍了一个通用框架Nebula-I,用于对远程异质群集进行协作训练深度学习模型,而这些群集之间的连接是低型带宽宽区域网络(WANS)。我们以自然语言处理(NLP)为例,以说明Nebula-I如何在不同的训练阶段工作,其中包括:a)使用两个远程群集预训练多语言语言模型; b)使用从预先训练的模型中提取的知识来微调机器翻译模型,该模型贯穿了最近最受欢迎的深度学习范式。为了平衡准确性和沟通效率,在星云I,参数有效的培训策略中,混合平行计算方法和自适应通信加速技术得到了共同应用。同时,采用安全策略来保证集群内计算和集群间通信中的安全性,可靠性和隐私性。 Nebula-i是通过桨式深度学习框架实施的,该框架可以支持有关异质硬件的协作培训,例如GPU和NPU。实验表明,所提出的框架可以实质上最大化训练效率,同时保持令人满意的NLP性能。通过使用Nebula-I,用户可以通过最低开发的云簇运行大规模的培训任务,并且可以进一步促进存在的大型预训练模型的实用性。我们还引入了有关跨语性自然语言推理任务的新最新结果,这些结果是基于新颖的学习框架和星云-I生成的。
The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.