论文标题
在三个国家 /地区应用数据综合数据合成
Applying Data Synthesis for Longitudinal Business Data across Three Countries
论文作者
论文摘要
统计机构收集的企业的数据是挑战性的。许多企业具有独特的特征,就业,销售和利润的分配偏差。希望进行识别攻击的攻击者通常可以访问比任何个人更多的信息。因此,大多数避免避免披露机制无法在实用性和机密性保护之间取得可接受的平衡。通过地理或详细的行业课程详细的总统计数据很少见,公共用途的微数据实际上是不存在的,并且访问机密的微数据可能是繁重的。已提出合成微型数据作为发布微数据的安全机制,这是关于如何为研究人员提供更广泛访问此类数据集的广泛讨论的一部分。在本文中,我们记录了一个实验,以使用先前针对美国使用的完全相同的模型和方法来创建有效的合成数据,用于来自两个不同国家的数据:加拿大(LEAP)和德国(BHP)。我们评估效用和保护,并评估以具有成本效益的方式扩展这种方法的可行性。
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.