基准测试长尾概括，可能分裂

论文标题

基准测试长尾概括，可能分裂

Benchmarking Long-tail Generalization with Likelihood Splits

论文作者

Godbole, Ameya, Jia, Robin

论文摘要

为了可靠地处理自然语言，NLP系统必须推广到罕见话语的长尾。我们提出了一种创建具有挑战性的基准的方法，该基准需要通过重新分解现有数据集来推广到分布的尾巴。我们创建“似然分裂”，其中将预先训练的语言模型（LM）分配的示例放置在测试集中，并且更有可能的示例是在培训集中。可以自定义这种简单的方法来构建有意义的火车测试拆分，以完成各种任务。与随机分裂相比，似然表面的挑战更多：针对蜘蛛的语义解析，最新模型的相对错误率增加了59％，SNLI的自然语言推断为93％，而BOOLQ上的YES/否提出了33％，与相应的随机分裂相比，BOOLQ上的问题是/否。此外，可能性拆分创造的基准比对抗性过滤更明确。当使用用于创建拆分的LM也被用作任务模型时，我们的拆分不会不公平地惩罚LM。

In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood Splits surface more challenges than random splits: relative error rates of state-of-the-art models increase by 59% for semantic parsing on Spider, 93% for natural language inference on SNLI, and 33% for yes/no question answering on BoolQ, on our splits compared with the corresponding random splits. Moreover, Likelihood Splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is also employed as the task model, our splits do not unfairly penalize the LM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题