论文标题
生产分类数据验证差异隐私:构思和机器学习的应用
Production of Categorical Data Verifying Differential Privacy: Conception and Applications to Machine Learning
论文作者
论文摘要
私人和公共组织定期收集和分析有关其员工,志愿者,客户等的数字化数据。但是,由于大多数个人数据都是敏感的,因此设计具有隐私性的系统存在着关键的挑战。为了解决隐私问题,研究社区提出了不同的方法来保护隐私,并以不同的隐私(DP)为正式定义,允许量化隐私性权衡权衡。此外,使用本地DP(LDP)模型,用户可以在将其传输到服务器之前在本地对其数据进行消毒。因此,本论文的目的是两倍:o $ _1 $),以在LDP保证的多次频率估计中改善效用和隐私,这是统计学习的基础。和o $ _2 $)评估了经过差异私人数据培训的机器学习(ML)模型的隐私 - 实用性权衡。对于o $ _1 $,我们首先从两个“多个”角度解决了该问题,即在整个时间内多个属性和多个集合,同时着重于实用程序。其次,我们将注意力集中在多个属性方面上,在该方面,我们提出了一个关注隐私的解决方案,同时保留了效用。在这两种情况下,我们通过分析和实验验证证明了我们提出的解决方案比最先进的LDP方案的优势。对于o $ _2 $,我们对基于ML的解决方案进行了经验评估,旨在解决现实世界中的问题,同时确保DP保证。实际上,我们主要使用隐私保护ML文献中的输入数据扰动设置。这就是整个数据集独立消毒的情况,因此,我们从集中式数据所有者的角度实现了自然界保证算法。在所有情况下,我们得出的结论是,差异化私有ML模型的实用性指标与非私有的指标几乎相同。
Private and public organizations regularly collect and analyze digitalized data about their associates, volunteers, clients, etc. However, because most personal data are sensitive, there is a key challenge in designing privacy-preserving systems. To tackle privacy concerns, research communities have proposed different methods to preserve privacy, with Differential privacy (DP) standing out as a formal definition that allows quantifying the privacy-utility trade-off. Besides, with the local DP (LDP) model, users can sanitize their data locally before transmitting it to the server. The objective of this thesis is thus two-fold: O$_1$) To improve the utility and privacy in multiple frequency estimates under LDP guarantees, which is fundamental to statistical learning. And O$_2$) To assess the privacy-utility trade-off of machine learning (ML) models trained over differentially private data. For O$_1$, we first tackled the problem from two "multiple" perspectives, i.e., multiple attributes and multiple collections throughout time, while focusing on utility. Secondly, we focused our attention on the multiple attributes aspect only, in which we proposed a solution focusing on privacy while preserving utility. In both cases, we demonstrate through analytical and experimental validations the advantages of our proposed solutions over state-of-the-art LDP protocols. For O$_2$, we empirically evaluated ML-based solutions designed to solve real-world problems while ensuring DP guarantees. Indeed, we mainly used the input data perturbation setting from the privacy-preserving ML literature. This is the situation in which the whole dataset is sanitized independently and, thus, we implemented LDP algorithms from the perspective of the centralized data owner. In all cases, we concluded that differentially private ML models achieve nearly the same utility metrics as non-private ones.