论文标题
我们会用完数据吗?基于人类生成的数据的LLM缩放的限制
Will we run out of data? Limits of LLM scaling based on human-generated data
论文作者
论文摘要
我们研究了公共人类生成的文本数据的可用性所带来的LLM扩展的潜在限制。我们根据当前趋势预测对培训数据的需求不断增长,并估算了公共人类文本数据的总库存。我们的发现表明,如果当前的LLM开发趋势继续下去,则将在数据集中对模型进行培训,该数据集的大小与2026年至2032年之间的公共文本数据可用库存大致相等,或者如果模型过度训练,则培训模型。我们探讨了当无法进一步缩放人类生成的文本数据集时,语言建模的进度如何继续。我们认为,综合数据生成,从数据富的域转移学习以及数据效率提高可能支持进一步的进展。
We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.