论文标题
$σ$ -ridge:通过经验贝叶斯噪声水平交叉验证的集体正规脊回归
$σ$-Ridge: group regularized ridge regression via empirical Bayes noise level cross-validation
论文作者
论文摘要
预测模型中的功能无法交换,但常见的监督模型将其视为这样。在这里,我们研究山脊回归,分析师可以根据外部侧面信息将功能分为$ K $组。例如,在高通量生物学中,特征可以代表基因表达,蛋白质丰度或临床数据,因此每个特征组都代表一种不同的方式。分析师的目标是选择最佳正则化参数$λ=(λ_1,\ dotsc,λ_k)$ - 每组一个。在这项工作中,我们通过在高维随机效应模型下以$ p \ asymp n $为$ n \ to \ infty $的高维随机效应模型来得出限制风险公式,从而研究$λ$对群体调查脊回归的预测风险的影响。此外,我们提出了一种数据驱动的方法,用于选择达到最佳渐近风险的$λ$:关键思想是解释残差噪声差异$σ^2 $,作为通过交叉验证选择的正规化参数。经验贝叶斯的构造将一维参数$σ$映射到正规化参数的$ k $维矢量,即$σ\ mapsto \widehatλ(σ)$。除了其理论最优性之外,提出的方法是实用的,并且运行速度与没有特征组的交叉验证山脊回归($ k = 1 $)。
Features in predictive models are not exchangeable, yet common supervised models treat them as such. Here we study ridge regression when the analyst can partition the features into $K$ groups based on external side-information. For example, in high-throughput biology, features may represent gene expression, protein abundance or clinical data and so each feature group represents a distinct modality. The analyst's goal is to choose optimal regularization parameters $λ= (λ_1, \dotsc, λ_K)$ -- one for each group. In this work, we study the impact of $λ$ on the predictive risk of group-regularized ridge regression by deriving limiting risk formulae under a high-dimensional random effects model with $p\asymp n$ as $n \to \infty$. Furthermore, we propose a data-driven method for choosing $λ$ that attains the optimal asymptotic risk: The key idea is to interpret the residual noise variance $σ^2$, as a regularization parameter to be chosen through cross-validation. An empirical Bayes construction maps the one-dimensional parameter $σ$ to the $K$-dimensional vector of regularization parameters, i.e., $σ\mapsto \widehatλ(σ)$. Beyond its theoretical optimality, the proposed method is practical and runs as fast as cross-validated ridge regression without feature groups ($K=1$).