论文标题

关于基于图的最近邻居搜索的注释

A Note on Graph-Based Nearest Neighbor Search

论文作者

Wang, Hongya, Wang, Zhizheng, Wang, Wei, Xiao, Yingyuan, Zhao, Zeng, Yang, Kaixiang

论文摘要

最近的邻居搜索发现了机器学习,数据挖掘和大量数据处理系统中的许多应用程序。在过去的几年中,基于图的最近的邻居搜索范式的普及,因为它优于太空分门算法。尽管许多实证研究都证明了基于图的算法的效率,但对一个更基本的问题并没有太多关注:为什么基于图基的算法在实践中效果很好?哪些数据属性会影响效率以及如何影响?在本文中,我们尝试回答这些问题。我们的见解是,“点O的邻居倾向于成为KNN图中的邻居的概率”是查询效率的关键数据属性。对于给定的数据集,可以通过KNN图的聚类系数来定性地测量此类属性。为了展示聚类系数如何影响性能,我们确定某些给定查询Q周围的本地连接而不是全球连接性,对回忆具有更直接的影响。具体而言,我们观察到,高聚类系数使Q的最接近Q位于图中的最大连接组件(SCC)中。从算法的角度来看,我们表明搜索过程实际上是由两个阶段组成的 - 最大SCC之外的一个阶段,另一个阶段与广泛接受的单个或多个路径搜索模型不同。我们证明,一旦访问其中的任何点,可以保证常用的基于图的搜索算法可以穿越最大SCC。我们的分析表明,高聚类系数可导致最大SCC的大尺寸,因此在两阶段搜索过程的帮助下提供了良好的答案质量。全面的数据集收集的广泛经验结果验证了我们的发现。

Nearest neighbor search has found numerous applications in machine learning, data mining and massive data processing systems. The past few years have witnessed the popularity of the graph-based nearest neighbor search paradigm because of its superiority over the space-partitioning algorithms. While a lot of empirical studies demonstrate the efficiency of graph-based algorithms, not much attention has been paid to a more fundamental question: why graph-based algorithms work so well in practice? And which data property affects the efficiency and how? In this paper, we try to answer these questions. Our insight is that "the probability that the neighbors of a point o tends to be neighbors in the KNN graph" is a crucial data property for query efficiency. For a given dataset, such a property can be qualitatively measured by clustering coefficient of the KNN graph. To show how clustering coefficient affects the performance, we identify that, instead of the global connectivity, the local connectivity around some given query q has more direct impact on recall. Specifically, we observed that high clustering coefficient makes most of the k nearest neighbors of q sit in a maximum strongly connected component (SCC) in the graph. From the algorithmic point of view, we show that the search procedure is actually composed of two phases - the one outside the maximum SCC and the other one in it, which is different from the widely accepted single or multiple paths search models. We proved that the commonly used graph-based search algorithm is guaranteed to traverse the maximum SCC once visiting any point in it. Our analysis reveals that high clustering coefficient leads to large size of the maximum SCC, and thus provides good answer quality with the help of the two-phase search procedure. Extensive empirical results over a comprehensive collection of datasets validate our findings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源