Starc：用于阅读理解的结构化注释

论文标题

Starc：用于阅读理解的结构化注释

STARC: Structured Annotations for Reading Comprehension

论文作者

Berzak, Yevgeni, Malmaud, Jonathan, Levy, Roger

论文摘要

我们介绍Starc（用于阅读理解的结构化注释），这是一个新的注释框架，用于评估具有多项选择问题的阅读理解。我们的框架为答案选择引入了原则性结构，并将其与文本跨度注释联系起来。该框架是在OneStopQA中实现的，oneStoPQA是一种新的高质量数据集，用于评估和分析英语阅读理解。我们使用此数据集证明可以利用Starc来为开发类似SAT的阅读理解材料的关键应用：通过SPAN消融实验进行自动注释质量探测。我们进一步表明，它可以在机器和人类阅读理解行为之间进行深入分析和比较，包括错误分布和猜测能力。我们的实验还表明，NLP中的标准多项选择数据集在衡量阅读理解的能力上有限。机器可以在不访问通道的情况下可以猜测47％的问题，而人类一致认为18％的问题是没有独特的正确答案。 OneStopQA提供了一个替代测试集，用于阅读理解，从而减轻了这些缺点，并且具有更高的人类天花板性能。

We present STARC (Structured Annotations for Reading Comprehension), a new annotation framework for assessing reading comprehension with multiple choice questions. Our framework introduces a principled structure for the answer choices and ties them to textual span annotations. The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English. We use this dataset to demonstrate that STARC can be leveraged for a key new application for the development of SAT-like reading comprehension materials: automatic annotation quality probing via span ablation experiments. We further show that it enables in-depth analyses and comparisons between machine and human reading comprehension behavior, including error distributions and guessing ability. Our experiments also reveal that the standard multiple choice dataset in NLP, RACE, is limited in its ability to measure reading comprehension. 47% of its questions can be guessed by machines without accessing the passage, and 18% are unanimously judged by humans as not having a unique correct answer. OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题