论文标题
K-BMOM:一种基于自举均值的强大的劳埃德型聚类算法
K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means
论文作者
论文摘要
我们提出了一种新的聚类算法,该算法对数据集中的异常值的存在非常强大。我们执行劳埃德型迭代,并对质心进行强大的估计。更确切地说,我们基于均值统计数据以估算质心的想法,但可以在构建块时进行更换。我们称这种方法为自举均值(BMOM),并证明,如果通过自举抽样生成足够的块,则比经典均值(MOM)具有更好的分解点的分解点,其中块形成数据集的分区。从聚类的角度来看,BMOM可以占据所需尺寸的许多块,从而避免了某些块中群集的消失,这是基于经典均值的基于分区的块可能发生的陷阱。模拟数据集上的实验表明,所提出的称为K-BMOM的方法比现有的基于K-Means的方法更好。在实践中,提供了调整高参数K-BMOM的准则。还建议从业者使用这种强大的方法来初始化其聚类算法。最后,考虑到我们的估算器的简化和理论版本,我们通过得出K均值扭曲的稳健收敛速率来证明其对对抗性污染的稳健性。据我们所知,这是K-均值扭曲的第一个结果。
We propose a new clustering algorithm that is robust to the presence of outliers in the dataset. We perform Lloyd-type iterations with robust estimates of the centroids. More precisely, we build on the idea of median-of-means statistics to estimate the centroids, but allow for replacement while constructing the blocks. We call this methodology the bootstrap median-of-means (bMOM) and prove that if enough blocks are generated through the bootstrap sampling, then it has a better breakdown point for mean estimation than the classical median-of-means (MOM), where the blocks form a partition of the dataset. From a clustering perspective, bMOM enables to take many blocks of a desired size, thus avoiding possible disappearance of clusters in some blocks, a pitfall that can occur for the partition-based generation of blocks of the classical median-of-means. Experiments on simulated datasets show that the proposed approach, called K-bMOM, performs better than existing robust K-means based methods. Guidelines are provided for tuning the hyper-parameters K-bMOM in practice. It is also recommended to the practitionner to use such a robust approach to initialize their clustering algorithm. Finally, considering a simplified and theoretical version of our estimator, we prove its robustness to adversarial contamination by deriving robust rates of convergence for the K-means distorsion. To our knowledge, it is the first result of this kind for the K-means distorsion.