论文标题

采用成对北端平滑的系统上有效地改善$ k $ - 英雄的初始化

Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing

论文作者

Baldassi, Carlo

论文摘要

我们提出了一种用于初始化(播种)$ k $ -MEANS聚类算法的元方法,称为PNN-Smoothing。它包括将给定的数据集拆分为$ j $随机子集,将每个数据集分别聚类,并将所得聚类合并与成对的nearest-neart-neyger(PNN)方法。从某种意义上说,在聚集单个子集时,任何种子算法都可以使用。如果该播种算法的计算复杂性在数据$ n $和簇$ k $的大小上是线性的,那么PNN-Smoothing几乎是线性的,可以选择$ J $,并且在实践中具有竞争力。我们在经验上使用几种现有的播种方法并在几个合成和真实数据集上进行测试,表明此过程在系统上会带来更高的成本。特别是,与流行的“贪婪” $ k $ -Mean-Mean-Means ++变体相比,我们增强$ K $ -MEANS ++种子的方法在有效性和速度上都具有优越性。我们的实施可在https://github.com/carlobaldassi/kmeanspnnsmoothing.jl上公开获得。

We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing $k$-means++ seeding proves superior in both effectiveness and speed compared to the popular "greedy" $k$-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源