荷兰人抽奖：为二进制预测模型构建通用基线

论文标题

荷兰人抽奖：为二进制预测模型构建通用基线

The Dutch Draw: Constructing a Universal Baseline for Binary Prediction Models

论文作者

van de Bijl, Etienne, Klein, Jan, Pries, Joris, Bhulai, Sandjai, Hoogendoorn, Mark, van der Mei, Rob

论文摘要

新颖的预测方法应始终与基准相提并论，以了解它们的性能。没有此参考框架，模型的性能得分基本上毫无意义。当模型在测试集中达到0.8美元的$ F_1 $时，这意味着什么？需要适当的基线来评估性能得分的“优点”。与最新的最新模型相比，通常是有见地的。但是，当开发较新的型号时，成为最先进的方法可能会迅速改变。与高级模型相反，可以使用简单的虚拟分类器。但是，后者很容易被击败，从而使比较的价值降低了。本文为所有二进制分类模型提供了一种通用的基线方法，称为荷兰抽奖（DD）。这种方法称重简单分类器，并确定最佳的分类器用作基线。从理论上讲，我们为许多常用的评估措施得出了DD基线，并表明在大多数情况下，它降低到（几乎）总是预测零或一个。总而言之，DD基线为：（1）一般，因为它适用于所有二元分类问题；（2）简单，因为它可以快速确定而无需训练或参数调用；（3）信息，因为结果可以得出深刻的结论。 DD基线有两个目的。首先，通过此强大而通用的基线进行研究论文的比较。其次，在预测模型的开发过程中提供理智检查。当模型胜过DD基线时，这是一个主要的警告信号。

Novel prediction methods should always be compared to a baseline to know how well they perform. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an $F_1$ of 0.8 on a test set? A proper baseline is needed to evaluate the `goodness' of a performance score. Comparing with the latest state-of-the-art model is usually insightful. However, being state-of-the-art can change rapidly when newer models are developed. Contrary to an advanced model, a simple dummy classifier could be used. However, the latter could be beaten too easily, making the comparison less valuable. This paper presents a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. We theoretically derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is: (1) general, as it is applicable to all binary classification problems; (2) simple, as it is quickly determined without training or parameter-tuning; (3) informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, to enable comparisons across research papers by this robust and universal baseline. Secondly, to provide a sanity check during the development process of a prediction model. It is a major warning sign when a model is outperformed by the DD baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题