一个贝叶斯bradley-terry模型，用于比较多个数据集上的多个ML算法

论文标题

一个贝叶斯bradley-terry模型，用于比较多个数据集上的多个ML算法

A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets

论文作者

Wainer, Jacques

论文摘要

本文提出了一个贝叶斯模型，以比较任何度量的多个数据集上的多种算法。该模型基于Bradley-Terry模型，该模型计算了一种算法在不同数据集上的性能要好于另一个算法的次数。由于其贝叶斯基础，贝叶斯布拉德利·特里模型（BBT）具有不同的特征，与频繁的方法相比，在多个数据集上比较多种算法，例如Demsar（2006）对平均等级的测试，以及Benavoli等人。（2016）多个成对的Wilcoxon测试，具有p调节程序。特别是，贝叶斯的方法允许对算法发表更多细微的陈述，而不是声称差异是统计学意义的，或者在统计上不显着。贝叶斯的方法还允许定义何时出于实际目的或实际等效区域（绳索）等效的何时等效。与Benavoli等人提出的贝叶斯签名的等级比较程序不同。（2017年），我们的方法可以为任何度量标准定义绳索，因为它基于概率声明，而不是基于该度量的差异。本文还提出了一个局部绳索概念，该概念评估了在某些交叉验证对某些其他算法平均值的平均度量之间的正差异是否应真正地视为基于效应大小，第一种算法比第二算法更好。该局部绳索提案独立于贝叶斯使用，可以根据等级的常见主义方法使用。可以使用实现BBT的R软件包和Python程序。

This paper proposes a Bayesian model to compare multiple algorithms on multiple data sets, on any metric. The model is based on the Bradley-Terry model, that counts the number of times one algorithm performs better than another on different data sets. Because of its Bayesian foundations, the Bayesian Bradley Terry model (BBT) has different characteristics than frequentist approaches to comparing multiple algorithms on multiple data sets, such as Demsar (2006) tests on mean rank, and Benavoli et al. (2016) multiple pairwise Wilcoxon tests with p-adjustment procedures. In particular, a Bayesian approach allows for more nuanced statements regarding the algorithms beyond claiming that the difference is or it is not statistically significant. Bayesian approaches also allow to define when two algorithms are equivalent for practical purposes, or the region of practical equivalence (ROPE). Different than a Bayesian signed rank comparison procedure proposed by Benavoli et al. (2017), our approach can define a ROPE for any metric, since it is based on probability statements, and not on differences of that metric. This paper also proposes a local ROPE concept, that evaluates whether a positive difference between a mean measure across some cross validation to the mean of some other algorithms is should be really seen as the first algorithm being better than the second, based on effect sizes. This local ROPE proposal is independent of a Bayesian use, and can be used in frequentist approaches based on ranks. A R package and a Python program that implements the BBT is available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题