论文标题
K-RATER可靠性:汇总人类注释可靠性的正确单位
k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations
论文作者
论文摘要
自从众包成立以来,汇总一直是处理不可靠数据的常见策略。总评级比单个评分更可靠。但是,许多依赖汇总评级的自然语言处理(NLP)应用仅报告单个评级的可靠性,这是不正确的分析单位。在这些情况下,数据可靠性不足,并且应将提出的K-Reseribal(KRR)用作汇总数据集的正确数据可靠性。它是评估者间可靠性(IRR)的多评价者概括。我们对单词-353基准进行了两种复制,并提出了基于经验,分析和基于自举的方法,用于在WordsIM-353上计算KRR。这些方法产生非常相似的结果。我们希望此讨论能够推动研究人员除IRR外报告KRR。
Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR.