论文标题

关于特征选择在构建两层决策树中的实用性

On the utility of feature selection in building two-tier decision trees

论文作者

Saltykov, Sergey A.

论文摘要

如今,当由于过度拟合或计算资源受到限制时出现性能退化的风险时,特征选择经常用于机器学习。在功能选择过程中,选择最相关和最少冗余的特征的子集。近年来,很明显,除了相关性和冗余外,还必须考虑特征的互补性。在非正式的情况下,如果特征是目标变量的弱预测指标,并且在组合组合时具有强烈的预测因子,则它们是互补的。本文证明,互补特征的协同效应在构建两层决策树时相互放大的协同效应可能会受到另一个功能的干扰,从而导致性能下降。在合成数据集和实际数据集,回归和分类上使用交叉验证证明了它可以消除或消除干扰功能可以提高性能长达24次。还发现,学到的域越少,性能的提高就越大。更正式地,证明在消除干扰功能后,数据集的性能与数据集的性能之间存在统计学上显着的负等级相关性。结论是,对于数据和计算资源足够的情况,这扩大了特征选择方法的范围。

Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementarity must be considered. Informally, if the features are weak predictors of the target variable separately and strong predictors when combined, then they are complementary. It is demonstrated in this paper that the synergistic effect of complementary features mutually amplifying each other in the construction of two-tier decision trees can be interfered with by another feature, resulting in a decrease in performance. It is demonstrated using cross-validation on both synthetic and real datasets, regression and classification, that removing or eliminating the interfering feature can improve performance by up to 24 times. It has also been discovered that the lesser the domain is learned, the greater the increase in performance. More formally, it is demonstrated that there is a statistically significant negative rank correlation between performance on the dataset prior to the elimination of the interfering feature and performance growth after the elimination of the interfering feature. It is concluded that this broadens the scope of feature selection methods for cases where data and computational resources are sufficient.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源