论文标题
卷积增强的不断发展的注意力网络
Convolution-enhanced Evolving Attention Networks
论文作者
论文摘要
基于注意力的神经网络(例如变形金刚)在许多应用中已无处不在,包括计算机视觉,自然语言处理和时间序列分析。在各种注意力网络中,注意图在输入令牌之间编码语义依赖性时至关重要。但是,大多数现有的注意力网络基于表示形式执行建模或推理,其中在没有明确相互作用的情况下分别学习不同层的注意图。在本文中,我们提出了一种新颖而通用的注意力机制,该机制直接通过一系列残余卷积模块来模拟toke沟的演变。主要动机是双重的。一方面,不同层中的注意力图共享可转移的知识,因此添加残留的连接可以促进跨层之间关系的信息流。另一方面,注意力图在不同的抽象水平上自然存在进化趋势,因此利用基于卷积的模块来捕获此过程是有益的。配备了拟议的机制,卷积增强的不断发展的注意力网络在各种应用程序中都具有出色的性能,包括时间序列表示,自然语言理解,机器翻译和图像分类。尤其是在时间序列表示任务上,不断发展注意力增强的扩张卷积(EA-DC-)变压器的表现优于最先进的模型,与最佳SOTA相比,平均提高了17%。据我们所知,这是第一部明确模拟注意力图的层次演变的作品。我们的实施可在https://github.com/pkuyym/evolvingateention上获得。
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations , wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention.