论文标题
混合数据深高斯混合模型:混合数据集的聚类模型
Mixed data Deep Gaussian Mixture Model: A clustering model for mixed datasets
论文作者
论文摘要
聚类混合数据提出了变量非常异构性质固有的众多挑战。尽管存在这种异质性,但仍可以从变量中提取判别信息以设计组来提取判别信息,但仍应能够进行聚类算法。在这项工作中,我们介绍了一种基于多层体系结构模型的聚类方法,称为混合Deep Gaussian混合模型(MDGMM),该方法可以将其视为自动合并在连续和非连续数据上分别执行的聚类的方法。该体系结构是灵活的,可以适应混合以及连续或不连续的数据。从这个意义上讲,我们将广义的线性潜在变量模型和深层混合物模型概括。我们还设计了一种新的初始化策略和一个数据驱动的方法,该方法选择了模型的最佳规范以及“飞行”给定数据集的最佳簇数。此外,我们的模型提供了数据的连续低维表示,这可以是可视化混合数据集的有用工具。最后,我们验证了方法的性能,将其结果与几个常用数据集的最先进的混合数据聚类模型进行了比较。
Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data driven method that selects the best specification of the model and the optimal number of clusters for a given dataset "on the fly". Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.