TABSYNDEX：用于合成表格数据的强大评估的通用度量

论文标题

TABSYNDEX：用于合成表格数据的强大评估的通用度量

TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data

论文作者

Chundawat, Vikram S, Tarun, Ayush K, Mandal, Murari, Lahoti, Mukund, Narang, Pratik

论文摘要

当真实数据有限，收集昂贵或由于隐私问题而无法使用时，合成的表格数据生成至关重要。但是，生成高质量的合成数据具有挑战性。已经提出了几种概率，统计，生成对抗网络（GAN）和基于变异的自动编码器（VAE）的方法，用于合成表格数据生成。一旦生成，评估合成数据的质量就非常具有挑战性。文献中已经使用了一些传统指标，但是缺乏常见，健壮和单一指标。这使得很难正确比较不同合成表格数据生成方法的有效性。在本文中，我们提出了一个新的通用度量标准，tabsyndex，以对合成数据进行强有力的评估。拟议的度量标准通过不同的组件分数评估合成数据与实际数据的相似性，这些分量分数评估了``高质量''合成数据所需的特征。作为单个得分度量并具有隐式界限，TabSyndex也可以用来观察和评估基于神经网络的方法的训练。这将有助于获得更早的见解。我们提出了几种基线模型，用于与现有生成模型对拟议评估度量的比较分析。我们还对TABSYNDEX和现有的合成表格数据评估指标进行了比较分析。这显示了我们指标对现有指标的有效性和普遍性。源代码：\ url {https://github.com/vikram2000b/tabsyndex}

Synthetic tabular data generation becomes crucial when real data is limited, expensive to collect, or simply cannot be used due to privacy concerns. However, producing good quality synthetic data is challenging. Several probabilistic, statistical, generative adversarial networks (GANs), and variational auto-encoder (VAEs) based approaches have been presented for synthetic tabular data generation. Once generated, evaluating the quality of the synthetic data is quite challenging. Some of the traditional metrics have been used in the literature but there is lack of a common, robust, and single metric. This makes it difficult to properly compare the effectiveness of different synthetic tabular data generation methods. In this paper we propose a new universal metric, TabSynDex, for robust evaluation of synthetic data. The proposed metric assesses the similarity of synthetic data with real data through different component scores which evaluate the characteristics that are desirable for ``high quality'' synthetic data. Being a single score metric and having an implicit bound, TabSynDex can also be used to observe and evaluate the training of neural network based approaches. This would help in obtaining insights that was not possible earlier. We present several baseline models for comparative analysis of the proposed evaluation metric with existing generative models. We also give a comparative analysis between TabSynDex and existing synthetic tabular data evaluation metrics. This shows the effectiveness and universality of our metric over the existing metrics. Source Code: \url{https://github.com/vikram2000b/tabsyndex}

下载PDF全文

下载文献需遵守相关版权规定

论文标题