饥饿的饥饿的河马：使用状态空间模型的语言建模

论文标题

饥饿的饥饿的河马：使用状态空间模型的语言建模

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

论文作者

Fu, Daniel Y., Dao, Tri, Saab, Khaled K., Thomas, Armin W., Rudra, Atri, Ré, Christopher

论文摘要

状态空间模型（SSM）已经以某些方式证明了最新的序列建模性能，但在语言建模中表现不佳。此外，尽管序列长度几乎线性地缩放而不是四边形，但由于硬件利用率不佳，SSM仍然比变压器慢。在本文中，我们在了解语言建模中的SSM和注意力之间的表达差距以及减少SSM和注意力之间的硬件障碍方面取得了进展。首先，我们使用合成语言建模任务来了解SSM和注意力之间的差距。我们发现现有的SSM与两种功能相遇：在序列中回顾较早的令牌并在整个序列中比较令牌。为了了解对语言建模的影响，我们提出了一个新的SSM层H3，该层是针对这些能力明确设计的。 H3符合对合成语言的关注，并且在OpenWebText上的变压器0.4 ppl之内。此外，一种混合的125m参数H3意见模型，它保留了两个注意层，令人惊讶的是，在OpenWebText上的表现优于1.0 ppl。接下来，为了提高现代硬件培训SSM的效率，我们提出了FlashConv。 FlashConv使用Fused Block FFT算法来提高最高8K的序列的效率，并引入了一种新型状态传递算法，该算法利用SSM的复发性能扩展到更长的序列。 FlashConv在远程竞技场基准上产生2 $ \ times $速度，并允许混合语言模型生成文本2.4 $ \ times $比变压器快。使用FlashConv，我们将混合H3注意语言模型扩展到堆上最高2.7b参数，并找到有希望的初始结果，在超级lue基准标中的大多数任务中，在零和几乎没有学习中，在零和几乎没有学习中的变压器都胜过更低的困惑。

State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题