WAVSPA：促进变形金刚的长序列学习能力的小波空间关注

论文标题

WAVSPA：促进变形金刚的长序列学习能力的小波空间关注

WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability

论文作者

Zhuang, Yufan, Wang, Zihan, Tao, Fangbo, Shang, Jingbo

论文摘要

变压器及其变体是深度学习中的基本神经体系结构。最近的作品表明，傅立叶空间中的学习注意力可以提高变形金刚的长序列学习能力。我们认为小波变换将是一个更好的选择，因为它以线性时间复杂性捕获了位置和频率信息。因此，在本文中，我们系统地研究了小波变换和变压器之间的协同作用。 We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform.在远距离领域进行的广泛实验表明，使用固定或自适应小波在小波空间中学习注意力可以始终如一地改善变形金刚的性能，并且在傅立叶空间中的学习表现明显优于学习。我们进一步表明我们的方法可以增强变形金刚在乐高链中的距离上的推理外推能力。

Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform learning in Fourier space. We further show our method can enhance Transformer's reasoning extrapolation capability over distance on the LEGO chain-of-reasoning task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题