在变压器解码器的子层功能上

论文标题

在变压器解码器的子层功能上

On the Sub-Layer Functionalities of Transformer Decoder

论文作者

Yang, Yilin, Wang, Longyue, Shi, Shuming, Tadepalli, Prasad, Lee, Stefan, Tu, Zhaopeng

论文摘要

为了解释基于变压器的编码器架构的编码器，用于神经机器翻译（NMT）的编码器；同时，尽管该解码器的作用至关重要。在翻译过程中，解码器必须通过考虑编码器的源语言文本和前一步中产生的目标前缀来预测输出令牌。在这项工作中，我们研究了基于变压器的解码如何利用来自源和目标语言的信息 - 开发通用探测任务，以评估如何通过每个解码器层的每个模块传播信息。我们在三个主要翻译数据集（WMT EN-DE，EN-FR和EN-ZH）上执行广泛的实验。我们的分析提供了有关解码何时何地利用不同来源的见解。基于这些见解，我们证明了每个变压器解码器层中的残留馈电模块可以以最小的性能损失而删除 - 计算和参数数量的显着降低，因此可以显着提高训练和推理速度。

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages -- developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance -- a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题