论文标题
大型语言模型难以学习长尾知识
Large Language Models Struggle to Learn Long-Tail Knowledge
论文作者
论文摘要
互联网包含丰富的知识 - 从历史人物的生日到有关如何编码的教程 - 所有这些都可以通过语言模型来学习。但是,尽管某些信息在网络上无处不在,但其他信息似乎很少。在本文中,我们研究了大语言模型记忆的知识与从网络上刮除的预培训数据集中的信息之间的关系。特别是,我们表明,语言模型回答基于事实的问题的能力与在预训练期间看到了与该问题相关的文档的数量。我们通过实体链接预培训数据集和计算包含与给定问题 - 答案对相同实体的文档的实体来确定这些相关文档。我们的结果表明,对于众多问题回答数据集(例如Triviaqa),培训前语料库(例如,根)和模型大小(例如,176b参数),准确性和相关文档计数之间的相关性和因果关系很强。此外,虽然较大的模型更擅长学习长尾知识,但我们估计,必须通过许多数量级来缩放今天的模型,才能在培训前数据中几乎没有支持的情况下达到竞争性的QA绩效。最后,我们表明检索仪可以减少对相关培训信息的依赖,从而提出了一种有希望的捕获长尾尾的方法。
The Internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, while certain pieces of information are ubiquitous on the web, others appear extremely rarely. In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant pre-training information, presenting a promising approach for capturing the long-tail.