论文标题
树木,森林和基于杂质的可变重要性
Trees, forests, and impurity-based variable importance
论文作者
论文摘要
诸如随机森林之类的树集合方法[Breiman,2001]非常受欢迎,可以处理高维表格数据集,尤其是因为它们具有良好的预测精度。但是,当机器学习用于决策问题时,解决最佳预测过程可能是不合理的,因为开明的决策需要对算法预测过程进行深入理解。不幸的是,随机森林在本质上是可以解释的,因为它们的预测是由数百种决策树的平均而产生的。获得这种所谓的黑盒算法知识的经典方法是计算变量的重要性,这些方法被用来评估每个输入变量的预测影响。然后,可变重要性被用来排名或选择变量,因此在数据分析中起着重要作用。然而,没有理由以这种方式使用随机森林的重要性:我们甚至都不知道这些数量估计是多少。在本文中,我们分析了两个众所周知的随机森林的重要性之一,平均降低杂质(MDI)。我们证明,如果输入变量是独立的并且在没有相互作用的情况下,MDI提供了输出的方差分解,其中每个变量的贡献都可以清楚地识别。我们还研究了在输入变量或相互作用之间表现出依赖性的模型,对于这些模型而言,可变重要性本质上不明显。我们的分析表明,与一棵树相比,使用森林可能存在一些好处。
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.