论文标题
光谱聚类,贝叶斯跨越森林和森林过程
Spectral Clustering, Bayesian Spanning Forest, and Forest Process
论文作者
论文摘要
光谱聚类将相似性矩阵视为加权图,并通过最大程度地减少示意图损失来分区数据。由于它最小化了群集的相似性,因此无需对每个集群中的分布进行建模。结果,人们减少了模型错误指定的机会,这通常是基于混合模型的聚类的风险。然而,与后者相比,光谱群集没有直接的方法来量化聚类不确定性(例如分配概率),也没有允许用于复杂数据应用程序的简易模型扩展。为了填补这一空白,我们建议贝叶斯森林模型作为光谱聚类的生成图形模型。这是由于我们发现的,即森林模型中的后验连接矩阵几乎具有与归一化频谱聚类使用的相同的领先特征向量。为了诱导森林的分布,我们开发了``森林过程''作为对urn过程的图形扩展,而我们仔细地表征了分区概率的差异。我们得出了一种简单的马尔可夫链蒙特卡洛算法进行后验估计,并且与现有算法相比,表现出卓越的性能。我们说明了几种对数据应用程序有用的基于模型的扩展,包括图像的高维和多视图聚类。
Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a ``forest process'' as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images.