论文标题
通过战略性删除数据进行分类
Classification with Strategically Withheld Data
论文作者
论文摘要
机器学习技术在诸如信用批准和大学入学等应用程序中很有用。但是,要在这种情况下更有利地分类,代理商可以决定战略性地扣留她的某些功能,例如不良的考试成绩。这是一个缺少扭曲的数据问题:缺少哪些数据{\ em取决于所选的分类器},因为特定的分类器可能会创造动机来扣留某些特征值。我们解决了对这种行为强大的培训分类器的问题。 我们设计三种分类方法:{\ sc mincut},{\ sc hill-climbing}({\ sc hc})和互动兼容的逻辑回归({\ sc ic-ic-lr})。我们表明,当数据的真实分布完全知道时,{\ sc mincut}是最佳的。但是,它可以产生复杂的决策边界,因此在某些情况下容易过度拟合。基于真实分类器的特征(即,那些不给予战略性隐藏特征的分类器的表征),我们设计了一种更简单的替代方案,称为{\ sc hc},该替代方案由层次的分类器组成,由临时分类器组成,经过培训,使用专门的爬山程序进行了培训,我们表明我们表明是收敛的。由于几个原因,{\ sc mincut}和{\ sc hc}无效地利用大量补充内容丰富的功能。为此,我们提出了{\ sc ic-lr},这是对逻辑回归的修改,从而消除了策略性下降特征的动机。我们还表明,我们的算法在现实世界数据集的实验中表现良好,并在不同的设置中对其相对性能进行了见解。
Machine learning techniques can be useful in applications such as credit approval and college admission. However, to be classified more favorably in such contexts, an agent may decide to strategically withhold some of her features, such as bad test scores. This is a missing data problem with a twist: which data is missing {\em depends on the chosen classifier}, because the specific classifier is what may create the incentive to withhold certain feature values. We address the problem of training classifiers that are robust to this behavior. We design three classification methods: {\sc Mincut}, {\sc Hill-Climbing} ({\sc HC}) and Incentive-Compatible Logistic Regression ({\sc IC-LR}). We show that {\sc Mincut} is optimal when the true distribution of data is fully known. However, it can produce complex decision boundaries, and hence be prone to overfitting in some cases. Based on a characterization of truthful classifiers (i.e., those that give no incentive to strategically hide features), we devise a simpler alternative called {\sc HC} which consists of a hierarchical ensemble of out-of-the-box classifiers, trained using a specialized hill-climbing procedure which we show to be convergent. For several reasons, {\sc Mincut} and {\sc HC} are not effective in utilizing a large number of complementarily informative features. To this end, we present {\sc IC-LR}, a modification of Logistic Regression that removes the incentive to strategically drop features. We also show that our algorithms perform well in experiments on real-world data sets, and present insights into their relative performance in different settings.