论文标题
LSG注意力:推断经过预定的变压器到长序列
LSG Attention: Extrapolation of pretrained Transformers to long sequences
论文作者
论文摘要
变压器模型在各种NLP任务上实现最新性能。然而,由于自我发挥的机制,它们受到了极大的限制,因此$ o(n^2)$关于序列长度的复杂性。为了回答这一限制,我们介绍了依赖于本地,稀疏和全球关注的LSG体系结构。我们表明,LSG的注意力在长期文档上的分类和汇总任务中快速,高效且有竞争力。有趣的是,它也可以用于调整现有的预告片模型,以有效地推断出更长的序列,而没有额外的培训。除了引入LSG注意机制外,我们还提出了培训新模型并根据这种机制调整现有模型的工具。
Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however suffer from a prohibitive limitation due to the self-attention mechanism, inducing $O(n^2)$ complexity with regard to sequence length. To answer this limitation we introduce the LSG architecture which relies on Local, Sparse and Global attention. We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents. Interestingly, it can also be used to adapt existing pretrained models to efficiently extrapolate to longer sequences with no additional training. Along with the introduction of the LSG attention mechanism, we propose tools to train new models and adapt existing ones based on this mechanism.