knn-dbscan：高尺寸的dbscan

论文标题

knn-dbscan：高尺寸的dbscan

KNN-DBSCAN: a DBSCAN in high dimensions

论文作者

Chen, Youguang, Ruys, William, Biros, George

论文摘要

聚类是机器学习中的一项基本任务。最成功和最广泛使用的算法之一是DBSCAN，这是一种基于密度的聚类算法。 DBSCAN需要$ε$ - 输入数据集的最邻居图，这些图由范围搜索算法和空间数据结构（如KD-Trees）计算。尽管为DBSCAN设计可扩展的实现努力，但现有工作仅限于低维数据集，因为在高维度中构建$ε$ - $ neart Thrage的图很昂贵。在本文中，我们修改了DBSCAN，以启用输入数据集的$κ$ neart邻居图。 $κ$ - 最终的邻居图是使用基于随机投影的近似算法构建的。尽管这些算法在高维度中可能会变得不准确或昂贵，但它们的内存开销要比构建$ε$ - 最近的邻居图的开销要低得多。我们描述了$ k $ nn-dbscan在与dbscan相同的聚类中的条件。我们还使用OpenMP进行了共享内存和MPI的分布式内存并行性的有效并行实现。我们在20个维度上以多达160亿点的积分介绍了结果，并使用合成数据进行了弱且强大的缩放研究。我们的代码在低维度和高维度上都是有效的。在德克萨斯州高级计算中心（TACC）的Frontera系统上，我们可以在不到一秒钟的28K核心上以不到一秒钟的速度将10亿点归为3D。在我们最大的运行中，我们在TACC的Frontera系统上使用114,688 x86核心在不到40秒的时间内将650亿分的积分在20个维度中群。此外，我们将与最先进的DBSCAN代码进行比较；在20D/4M点数据集上，我们的代码更快的速度高达37 $ \ times $。

Clustering is a fundamental task in machine learning. One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm. DBSCAN requires $ε$-nearest neighbor graphs of the input dataset, which are computed with range-search algorithms and spatial data structures like KD-trees. Despite many efforts to design scalable implementations for DBSCAN, existing work is limited to low-dimensional datasets, as constructing $ε$-nearest neighbor graphs is expensive in high-dimensions. In this paper, we modify DBSCAN to enable use of $κ$-nearest neighbor graphs of the input dataset. The $κ$-nearest neighbor graphs are constructed using approximate algorithms based on randomized projections. Although these algorithms can become inaccurate or expensive in high-dimensions, they possess a much lower memory overhead than constructing $ε$-nearest neighbor graphs. We delineate the conditions under which $k$NN-DBSCAN produces the same clustering as DBSCAN. We also present an efficient parallel implementation of the overall algorithm using OpenMP for shared memory and MPI for distributed memory parallelism. We present results on up to 16 billion points in 20 dimensions, and perform weak and strong scaling studies using synthetic data. Our code is efficient in both low and high dimensions. We can cluster one billion points in 3D in less than one second on 28K cores on the Frontera system at the Texas Advanced Computing Center (TACC). In our largest run, we cluster 65 billion points in 20 dimensions in less than 40 seconds using 114,688 x86 cores on TACC's Frontera system. Also, we compare with a state of the art parallel DBSCAN code; on 20d/4M point dataset, our code is up to 37$\times$ faster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题