论文标题
基准测试长尾概括,可能分裂
Benchmarking Long-tail Generalization with Likelihood Splits
论文作者
论文摘要
为了可靠地处理自然语言,NLP系统必须推广到罕见话语的长尾。我们提出了一种创建具有挑战性的基准的方法,该基准需要通过重新分解现有数据集来推广到分布的尾巴。我们创建“似然分裂”,其中将预先训练的语言模型(LM)分配的示例放置在测试集中,并且更有可能的示例是在培训集中。可以自定义这种简单的方法来构建有意义的火车测试拆分,以完成各种任务。与随机分裂相比,似然表面的挑战更多:针对蜘蛛的语义解析,最新模型的相对错误率增加了59%,SNLI的自然语言推断为93%,而BOOLQ上的YES/否提出了33%,与相应的随机分裂相比,BOOLQ上的问题是/否。此外,可能性拆分创造的基准比对抗性过滤更明确。当使用用于创建拆分的LM也被用作任务模型时,我们的拆分不会不公平地惩罚LM。
In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood Splits surface more challenges than random splits: relative error rates of state-of-the-art models increase by 59% for semantic parsing on Spider, 93% for natural language inference on SNLI, and 33% for yes/no question answering on BoolQ, on our splits compared with the corresponding random splits. Moreover, Likelihood Splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is also employed as the task model, our splits do not unfairly penalize the LM.