论文标题
关于自我注意的计算复杂性
On The Computational Complexity of Self-Attention
论文作者
论文摘要
变压器体系结构在许多最新应用程序中取得了显着进步。然而,尽管取得了成功,现代变形金刚依赖于自我发挥的机制,其时间和空间复杂性在输入的长度上是二次的。已经提出了几种方法来加快自我注意力的机制以实现次级运行时间。但是,这些作品中的绝大多数都不伴随着严格的错误保证。在这项工作中,我们在许多情况下建立了自我注意力的计算复杂性的下限。我们证明,自我注意的时间复杂性在输入长度上必定是二次的,除非强大的指数时间假设(SETH)是错误的。即使仅执行了大约进行注意力计算,也适用于各种注意力机制。作为对我们的下限的补充,我们表明,使用有限的泰勒级数在线性时间中近似点产物自我发作,以指数依赖于多项式顺序。
Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.