论文标题
文本表征工具包
Text Characterization Toolkit
论文作者
论文摘要
在NLP中,通常通过在许多随时可用的基准测试的情况下报告单数性能得分来评估模型,而没有更深入的分析。在这里,我们认为 - 尤其是众所周知的事实,即基准通常包含偏见,人工制品和虚假相关性 - 更深层的结果分析在提出新模型或基准测试时应该成为事实上的标准。我们提出了一种工具,研究人员可以使用该工具来研究数据集的属性以及这些属性对模型行为的影响。我们的文本表征工具包既包括易于使用的注释工具,也包括可用于特定分析的现成脚本。我们还从三个不同的域中展示了用例:我们使用该工具来预测给定著名训练的模型的困难示例,并确定数据集中存在的(潜在有害的)偏见和启发式方法。
In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. We present a tool that researchers can use to study properties of the dataset and the influence of those properties on their models' behaviour. Our Text Characterization Toolkit includes both an easy-to-use annotation tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from three different domains: we use the tool to predict what are difficult examples for given well-known trained models and identify (potentially harmful) biases and heuristics that are present in a dataset.