专家的残留混合物

论文标题

专家的残留混合物

Residual Mixture of Experts

论文作者

Wu, Lemeng, Liu, Mengchen, Chen, Yinpeng, Chen, Dongdong, Dai, Xiyang, Yuan, Lu

论文摘要

专家（MOE）的混合物能够有效地扩展视觉变压器。但是，它需要禁止计算资源来训练大型MOE变压器。在本文中，我们提出了专家的残留混合物（RMOE），这是在下游任务（例如分割和检测）上针对MOE视觉变压器的有效训练管道。 RMOE通过上限的MOE培训取得了可比的结果，而仅引入较小的额外培训成本，而不是较低的非MOE训练管道。效率得到了我们的关键观察的支持：MOE变压器的权重可以分解为独立于输入的核心和输入依赖性残差。与重量核心相比，可以通过较少的计算资源（例如，在下游数据上进行填充）进行有效训练重量残差。我们表明，与当前的MOE培训管道相比，我们获得了可比的结果，同时节省了30％以上的培训成本。与最先进的非MOE变压器（例如SWIN-T / CVT-13 / SWIN-L）相比，我们在ADE20K分割方面获得+1.1 / 0.9 / 1.0 MIOU的增益，MS-Coco对象检测任务的ADE20K分割和+1.4 / 1.6 / 0.6 AP增益，且较少3％。

Mixture of Experts (MoE) is able to scale up vision transformers effectively. However, it requires prohibiting computation resources to train a large MoE transformer. In this paper, we propose Residual Mixture of Experts (RMoE), an efficient training pipeline for MoE vision transformers on downstream tasks, such as segmentation and detection. RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost than the lower-bound non-MoE training pipelines. The efficiency is supported by our key observation: the weights of an MoE transformer can be factored into an input-independent core and an input-dependent residual. Compared with the weight core, the weight residual can be efficiently trained with much less computation resource, e.g., finetuning on the downstream data. We show that, compared with the current MoE training pipeline, we get comparable results while saving over 30% training cost. When compared with state-of-the-art non- MoE transformers, such as Swin-T / CvT-13 / Swin-L, we get +1.1 / 0.9 / 1.0 mIoU gain on ADE20K segmentation and +1.4 / 1.6 / 0.6 AP gain on MS-COCO object detection task with less than 3% additional training cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题