论文标题
来自已发表统计的人口普查微数据的置信度排名
Confidence-Ranked Reconstruction of Census Microdata from Published Statistics
论文作者
论文摘要
对私有数据集$ d $的重建攻击将作为有关数据集的一些公开访问信息,并产生$ d $的候选元素列表。我们基于非凸优化的随机方法引入了一类新的数据重建攻击。我们从经验上证明,我们的攻击不仅可以从总查询统计数据$ q(d)\ in \ mathbb {r}^m $中重建$ d $的全行,而且可以通过可靠地对私人数据中出现的赔率进行重建行,以确定犯罪率的命中率,以将其用于更优先划分的私人数据,以确定该命中率或构建率的方式,以将其排名为重建行。我们还设计了一系列基准,用于评估重建攻击。我们的攻击大大优于那些仅基于对公共分销或人口访问的攻击,从中取样了私人数据集$ d $,这表明他们正在利用总统计信息$ q(d)$中的信息,而不仅仅是分布的整体结构。换句话说,查询$ q(d)$允许重建该数据集的元素,而不是绘制$ d $的分布。这些发现均在2010年美国十年级人口普查数据和查询以及普查衍生的美国社区调查数据集中建立。综上所述,我们的方法和实验说明了释放大型数据集的数值精确汇总统计信息的风险,并为仔细应用诸如差异隐私之类的私人技术提供了进一步的动力。
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.