置换不变的表格数据综合

论文标题

置换不变的表格数据综合

Permutation-Invariant Tabular Data Synthesis

论文作者

Zhu, Yujin, Zhao, Zilong, Birke, Robert, Chen, Lydia Y.

论文摘要

表格数据合成是一种新兴的方法，可以在通过大数据发现知识的同时规避严格的数据隐私法规。尽管基于AI的最先进的表格数据合成器，例如Table-Gan，Ctgan，TVAE和CTAB-GAN有效地生成合成表格数据，但它们的训练对输入数据的列排列很敏感。在本文中，我们首先进行了一项广泛的经验研究，以披露这种排列不变性的属性和对现有合成器的深入分析。我们表明，由于表格数据和网络体系结构的编码，更改输入列顺序使真实数据和合成数据之间的统计差异最高为38.67％。 To fully unleash the potential of big synthetic tabular data, we propose two solutions: (i) AE-GAN, a synthesizer that uses an autoencoder network to represent the tabular data and GAN networks to synthesize the latent representation, and (ii) a feature sorting algorithm to find the suitable column order of input data for CNN-based synthesizers.我们根据对色谱柱排列的敏感性，合成数据的质量以及下游分析中的实用性评估了五个数据集上提出的解决方案。我们的结果表明，与现有合成器相比，在训练合成器训练合成器并进一步提高合成数据的质量和实用性时，我们可以增强置换不变性的性质。

Tabular data synthesis is an emerging approach to circumvent strict regulations on data privacy while discovering knowledge through big data. Although state-of-the-art AI-based tabular data synthesizers, e.g., table-GAN, CTGAN, TVAE, and CTAB-GAN, are effective at generating synthetic tabular data, their training is sensitive to column permutations of input data. In this paper, we first conduct an extensive empirical study to disclose such a property of permutation invariance and an in-depth analysis of the existing synthesizers. We show that changing the input column order worsens the statistical difference between real and synthetic data by up to 38.67% due to the encoding of tabular data and the network architectures. To fully unleash the potential of big synthetic tabular data, we propose two solutions: (i) AE-GAN, a synthesizer that uses an autoencoder network to represent the tabular data and GAN networks to synthesize the latent representation, and (ii) a feature sorting algorithm to find the suitable column order of input data for CNN-based synthesizers. We evaluate the proposed solutions on five datasets in terms of the sensitivity to the column permutation, the quality of synthetic data, and the utility in downstream analyses. Our results show that we enhance the property of permutation-invariance when training synthesizers and further improve the quality and utility of synthetic data, up to 22%, compared to the existing synthesizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题