论文标题
汽车:具有自适应计算的异质混合物,用于有效的神经机器翻译
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
论文作者
论文摘要
Expert(MOE)模型的混合物已在神经机器翻译(NMT)任务中获得了最先进的性能。 MOE中的现有作品主要考虑一种均匀的设计,在整个网络中均匀地放置了相同规模的相同数量的专家。此外,现有的MOE作品不考虑计算限制(例如Flops,延迟)来指导其设计。为此,我们开发了Automoe,这是一个在计算限制下设计异质MOE的框架。汽车利用神经体系结构搜索(NAS)以获得具有4倍推理加速器(CPU)的高效稀疏MOE子转换器,并在手动设计的变压器上降低了FLOPS,在密集变压器上和MOE SwitchTransFormer的1个BLEU点的BLEU得分均衡,在BENCHMARK数据集合的Moe SwitchTransFormer的1 BLEU点内,均为NMT。具有密集且稀疏激活的变压器模块的异质搜索空间(例如,多少专家?将它们放置在哪里?应该放置什么?)允许自适应计算 - 其中不同数量的计算用于输入中的不同令牌。适应性自然来自路由决策,这些决策将令牌发送给不同大小的专家。汽车代码,数据和受过训练的模型可在https://aka.ms/automoe上找到。
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this end, we develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints. AutoMoE leverages Neural Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with 4x inference speedup (CPU) and FLOPs reduction over manually designed Transformers, with parity in BLEU score over dense Transformer and within 1 BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for NMT. Heterogeneous search space with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?) allows for adaptive compute -- where different amounts of computations are used for different tokens in the input. Adaptivity comes naturally from routing decisions which send tokens to experts of different sizes. AutoMoE code, data, and trained models are available at https://aka.ms/AutoMoE.