论文标题
伪造特征的重要性:关于使用差异私有综合数据的警示故事
Faking feature importance: A cautionary tale on the use of differentially-private synthetic data
论文作者
论文摘要
合成数据集通常是作为银行 - 扣除隐私数据发布问题的解决方案。但是,对于许多应用程序,用于训练预测模型时的合成数据已显示出有限的实用性。这些数据的一个有希望的潜在应用是在机器学习工作流的探索阶段,其中涉及理解,工程和选择功能。此阶段通常涉及相当大的时间,并取决于数据的可用性。综合数据将有实质性的价值,允许进行这些步骤,例如,正在协商数据访问或更少的信息治理限制。本文对从RAW和合成数据获得的特征重要性(在人工生成和真实世界的数据集中获得的特征重要性之间的一致性(在预测结果预测结果时)的有用程度)。我们采用两种差异性私有方法来生成合成数据,并采用各种实用措施来量化特征重要性的协议,因为这随隐私的水平而变化。我们的结果表明,合成数据有时可以保留在简单设置中特征重要性排名的几个表示,但是它们的性能不一致,并且取决于许多因素。应在更细微的现实世界设置中特别谨慎,其中合成数据可能会导致排名特征重要性的差异,从而改变关键建模决策。这项工作对于在金融和医疗保健等领域中开发高度敏感数据集的合成版本具有重要意义。
Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering and selecting features. This phase often involves considerable time, and depends on the availability of data. There would be substantial value in synthetic data that permitted these steps to be carried out while, for example, data access was being negotiated, or with fewer information governance restrictions. This paper presents an empirical analysis of the agreement between the feature importance obtained from raw and from synthetic data, on a range of artificially generated and real-world datasets (where feature importance represents how useful each feature is when predicting a the outcome). We employ two differentially-private methods to produce synthetic data, and apply various utility measures to quantify the agreement in feature importance as this varies with the level of privacy. Our results indicate that synthetic data can sometimes preserve several representations of the ranking of feature importance in simple settings but their performance is not consistent and depends upon a number of factors. Particular caution should be exercised in more nuanced real-world settings, where synthetic data can lead to differences in ranked feature importance that could alter key modelling decisions. This work has important implications for developing synthetic versions of highly sensitive data sets in fields such as finance and healthcare.