K-BMOM：一种基于自举均值的强大的劳埃德型聚类算法

论文标题

K-BMOM：一种基于自举均值的强大的劳埃德型聚类算法

K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means

论文作者

Brunet-Saumard, Camille, Genetay, Edouard, Saumard, Adrien

论文摘要

我们提出了一种新的聚类算法，该算法对数据集中的异常值的存在非常强大。我们执行劳埃德型迭代，并对质心进行强大的估计。更确切地说，我们基于均值统计数据以估算质心的想法，但可以在构建块时进行更换。我们称这种方法为自举均值（BMOM），并证明，如果通过自举抽样生成足够的块，则比经典均值（MOM）具有更好的分解点的分解点，其中块形成数据集的分区。从聚类的角度来看，BMOM可以占据所需尺寸的许多块，从而避免了某些块中群集的消失，这是基于经典均值的基于分区的块可能发生的陷阱。模拟数据集上的实验表明，所提出的称为K-BMOM的方法比现有的基于K-Means的方法更好。在实践中，提供了调整高参数K-BMOM的准则。还建议从业者使用这种强大的方法来初始化其聚类算法。最后，考虑到我们的估算器的简化和理论版本，我们通过得出K均值扭曲的稳健收敛速率来证明其对对抗性污染的稳健性。据我们所知，这是K-均值扭曲的第一个结果。

We propose a new clustering algorithm that is robust to the presence of outliers in the dataset. We perform Lloyd-type iterations with robust estimates of the centroids. More precisely, we build on the idea of median-of-means statistics to estimate the centroids, but allow for replacement while constructing the blocks. We call this methodology the bootstrap median-of-means (bMOM) and prove that if enough blocks are generated through the bootstrap sampling, then it has a better breakdown point for mean estimation than the classical median-of-means (MOM), where the blocks form a partition of the dataset. From a clustering perspective, bMOM enables to take many blocks of a desired size, thus avoiding possible disappearance of clusters in some blocks, a pitfall that can occur for the partition-based generation of blocks of the classical median-of-means. Experiments on simulated datasets show that the proposed approach, called K-bMOM, performs better than existing robust K-means based methods. Guidelines are provided for tuning the hyper-parameters K-bMOM in practice. It is also recommended to the practitionner to use such a robust approach to initialize their clustering algorithm. Finally, considering a simplified and theoretical version of our estimator, we prove its robustness to adversarial contamination by deriving robust rates of convergence for the K-means distorsion. To our knowledge, it is the first result of this kind for the K-means distorsion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题