论文标题
飞镖:找到最佳的注意力类型
DARTFormer: Finding The Best Type Of Attention
论文作者
论文摘要
鉴于不同有效的变压器注意机制范围广泛,而且范围不断增长,重要的是要确定在给予任务时最有效的注意力。在这项工作中,我们也有兴趣将不同的注意力类型相结合以构建异质变压器。我们首先提出了类似飞镖的神经结构搜索(NAS)方法,以找到给定任务的最佳关注,在此设置中,所有头部都使用相同的注意力(均匀的模型)。我们的结果表明,NAS在这项任务上非常有效,它标识了IMDB字节级文本分类和ListOps的最佳注意力机制。然后,我们将框架扩展到搜索和构建具有多种不同注意力类型的变压器,并将其称为异质变压器。我们表明,尽管这些异质变压器比平均均匀模型更好,但它们不能表现最好。我们探讨了异质关注有意义的原因,以及为什么它最终失败的原因。
Given the wide and ever growing range of different efficient Transformer attention mechanisms, it is important to identify which attention is most effective when given a task. In this work, we are also interested in combining different attention types to build heterogeneous Transformers. We first propose a DARTS-like Neural Architecture Search (NAS) method to find the best attention for a given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest that NAS is highly effective on this task, and it identifies the best attention mechanisms for IMDb byte level text classification and Listops. We then extend our framework to search for and build Transformers with multiple different attention types, and call them heterogeneous Transformers. We show that whilst these heterogeneous Transformers are better than the average homogeneous models, they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense, and why it ultimately fails.