论文标题
课程:自然语言理解中语言现象的宽覆盖基准
Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding
论文作者
论文摘要
在大型变压器语言模型的时代,语言评估在诊断模型的能力和对自然语言理解的局限性方面起着重要作用。但是,当前的评估方法显示出一些重大的缺点。特别是,它们没有提供有关语言模型如何捕获语言理解和推理至关重要的独特语言技能的洞察力。因此,他们无法有效地绘制出对现有模型仍然具有挑战性的语言理解的各个方面,这使得很难发现模型和数据集中的潜在局限性。在本文中,我们将课程作为NLI基准的新格式,用于评估宽覆盖的语言现象。课程包含涵盖36种主要语言现象类型的数据集的集合,以及用于诊断语言模型如何捕获不同语言现象的推理技能的评估程序。我们表明,这种语言 - 苯梅纳驱动的基准可以作为诊断模型行为和验证模型学习质量的有效工具。此外,我们的实验还提供了对现有基准数据集和最先进模型的限制的见解,这些模型可能会鼓励对重新设计数据集,模型架构和学习目标的未来研究。
In the age of large transformer language models, linguistic evaluation play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current evaluation methods show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic skills essential for language understanding and reasoning. Thus they fail to effectively map out the aspects of language understanding that remain challenging to existing models, which makes it hard to discover potential limitations in models and datasets. In this paper, we introduce Curriculum as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. Curriculum contains a collection of datasets that covers 36 types of major linguistic phenomena and an evaluation procedure for diagnosing how well a language model captures reasoning skills for distinct types of linguistic phenomena. We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality. In addition, our experiments provide insight into the limitation of existing benchmark datasets and state-of-the-art models that may encourage future research on re-designing datasets, model architectures, and learning objectives.