如何使用K-均值进行大数据聚类？

论文标题

如何使用K-均值进行大数据聚类？

How to Use K-means for Big Data Clustering?

论文作者

Mussabayev, Rustam, Mladenovic, Nenad, Jarboui, Bassem, Mussabayev, Ravil

论文摘要

K均值在数据挖掘中起着至关重要的作用，并且是欧几里得最低平方群集群（MSSC）模型下最简单，最广泛使用的算法。但是，当应用于大量数据时，其性能会大大下降。因此，通过使用尽可能少的计算资源将其缩放到大数据来改善K-均值至关重要：数据，时间和算法成分。我们提出了一种新的并行方案，该方案使用K-均值和K-Means ++算法进行大数据聚类，以满足``True Big Data''算法的属性，并在解决方案质量和运行时胜过经典和最新的最新MSSC方法。新方法自然可以通过分解MSSC问题而无需使用其他元启发术来实现全球搜索。这项工作表明，数据分解是解决大数据聚类问题的基本方法。新算法的经验成功使我们能够挑战普遍的信念，即需要更多数据以获得良好的聚类解决方案。此外，目前的工作质疑了更复杂的混合方法和算法需要获得更好的聚类解决方案的既定趋势。

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题