论文标题
刚性:可靠的线性回归,缺少数据
RIGID: Robust Linear Regression with Missing Data
论文作者
论文摘要
我们提出了一个强大的框架,以执行线性回归,而功能中缺少条目。通过考虑椭圆数据分布,特别是多元正常模型,我们能够有条件地为缺失条目的分布制定分布并提出一个健壮的框架,这最大程度地减少了由于缺失数据的不确定性而造成的最严重的情况。我们表明,所提出的公式自然考虑了不同变量之间的依赖性,最终会减少到凸面程序,可以为其提供定制的可扩展求解器。除了提供该求解器的详细分析外,我们还渐近地分析了所提出的框架的行为,并进行了技术讨论以估算所需的输入参数。我们通过对合成,半合成和真实数据进行的实验进行补充,并展示提出的配方如何提高预测准确性和鲁棒性,并优于竞争技术。 缺少数据是与机器学习中许多数据集相关的常见问题。随着使用强大的优化技术来训练机器学习模型的显着增加,本文提出了一种新颖的健壮回归框架,该框架通过最大程度地减少与缺少数据相关的不确定性来运行。所提出的方法允许使用不完整数据的培训模型,同时最大程度地减少与不可用数据相关的不确定性的影响。本文开发的想法可以推广到线性模型和椭圆数据分布之外。
We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions.