论文标题
注意力头的混合物:选择注意力头的每个令牌
Mixture of Attention Heads: Selecting Attention Heads Per Token
论文作者
论文摘要
已经提出了Experts(MOE)网络的混合物,作为扩展模型容量并实施条件计算的有效方法。但是,对MOE组件的研究主要集中在变压器体系结构中的前馈层。本文提出了注意力头(MOA)的混合物,这是一种将多头关注与MOE机制相结合的新结构。 MOA包括一组注意力头,每个注意力头都有自己的一组参数。给定输入,路由器动态选择了$ k $注意头的子集。该条件计算模式使MOA比标准的多头注意层实现更强的性能。此外,稀疏的门控MOA可以轻松地扩大注意力头的数量和参数数量,同时保持计算效率。除了改进性能外,MOA还自动区分了头部的实用程序,提供了一种新的观点来讨论该模型的解释性。我们对几个重要任务进行了实验,包括机器翻译和蒙版语言建模。实验已经针对涉及大型和非常深模型的强大基线的多个任务显示出令人鼓舞的结果。
Mixture-of-Experts (MoE) networks have been proposed as an efficient way to scale up model capacity and implement conditional computing. However, the study of MoE components mostly focused on the feedforward layer in Transformer architecture. This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head attention with the MoE mechanism. MoA includes a set of attention heads that each has its own set of parameters. Given an input, a router dynamically selects a subset of $k$ attention heads per token. This conditional computation schema allows MoA to achieve stronger performance than the standard multi-head attention layer. Furthermore, the sparsely gated MoA can easily scale up the number of attention heads and the number of parameters while preserving computational efficiency. In addition to the performance improvements, MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability. We conducted experiments on several important tasks, including Machine Translation and Masked Language Modeling. Experiments have shown promising results on several tasks against strong baselines that involve large and very deep models.