论文标题
具有自动化检索的神经符号语言建模
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
论文作者
论文摘要
基于检索的语言模型(R-LM)通过将标准语言模型(LM)与在测试时从外部数据存储中检索的示例相结合,模拟自然语言文本的概率。虽然有效,但在实践中使用这些模型的主要瓶颈是计算昂贵的数据存储搜索,可以像每个时间步骤一样频繁地执行。在本文中,我们提出了retomaton-检索自动机 - 基于(1)在连续的数据存储条目之间保存指针,以及(2)将条目聚类到“状态”中。这有效地导致了在数据存储顶部构建的加权有限自动机,而不是将数据存储表示为平面列表。自动机的创建是无监督的,可以从任何文本集合中构造一个retomaton:原始培训语料库或另一个域。在推理时(与LM推理平行地穿越此自动机)将其困惑降低到1.85,或者可节省多达$ K $ nn-LM(Khandelwal等人,2020年)的最近邻居搜索的83%,而不会伤害任何困扰性。我们的代码和训练有素的模型可在https://github.com/neulab/retomaton上找到。
Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .