论文标题
最佳子集选择是强大的设计依赖性
Best subset selection is robust against design dependence
论文作者
论文摘要
最佳子集选择(BSS)被广泛称为高维变量选择的圣杯。然而,BSS的臭名昭著的NP固定性基本上限制了其实际应用,并且在某种程度上不鼓励其理论发展,尤其是在当前的大数据时代。在本文中,我们研究了BSS的靶标稀疏性大于或等于真实稀疏性时的变量选择属性。我们的主要信息是,BSS在实现模型的一致性和确定的筛选方面对设计依赖性非常强大,更重要的是,这种鲁棒性可以传播到计算上有形的几乎最好的子集。具体而言,我们引入了一个不含受限制特征值的可识别性边际条件,并表明BSS精确恢复真实模型是足够的,几乎必要的。这种情况的轻松版本也足以让BSS获得确定的筛选属性。此外,考虑到优化误差,我们发现所有确切最佳子集的已建立统计属性都将其固定在任何近乎最佳的子集中,其剩余的正方形总和足够接近最佳的子集。特别是,两阶段的完全纠正性迭代的硬阈值(IHT)算法可以证明可以在对数步骤中找到一个稀疏的确定筛选子集;该组中的另一轮确切BSS可以恢复真实的模型。模拟研究和真实数据示例表明,与包括Lasso,SCAD和确定的独立性筛选(SIS)在内的竞争方法相比,IHT产生的错误发现率较低,而真实的正率更高,尤其是在高度相关的设计下。
Best subset selection (BSS) is widely known as the holy grail for high-dimensional variable selection. Nevertheless, the notorious NP-hardness of BSS substantially restricts its practical application and also discourages its theoretical development to some extent, particularly in the current era of big data. In this paper, we investigate the variable selection properties of BSS when its target sparsity is greater than or equal to the true sparsity. Our main message is that BSS is robust against design dependence in terms of achieving model consistency and sure screening, and more importantly, that such robustness can be propagated to the near best subsets that are computationally tangible. Specifically, we introduce an identifiability margin condition that is free of restricted eigenvalues and show that it is sufficient and nearly necessary for BSS to exactly recover the true model. A relaxed version of this condition is also sufficient for BSS to achieve the sure screening property. Moreover, taking optimization error into account, we find that all the established statistical properties for the exact best subset carry over to any near best subset whose residual sum of squares is close enough to that of the best one. In particular, a two-stage fully corrective iterative hard thresholding (IHT) algorithm can provably find a sparse sure screening subset within logarithmic steps; another round of exact BSS within this set can recover the true model. The simulation studies and real data examples show that IHT yields lower false discovery rates and higher true positive rates than the competing approaches including LASSO, SCAD and Sure Independence Screening (SIS), especially under highly correlated design.