论文标题
数据依赖性算法,用于查询地球搬运工的距离,尺寸较低
A Data-Dependent Algorithm for Querying Earth Mover's Distance with Low Doubling Dimensions
论文作者
论文摘要
在本文中,我们考虑以下查询问题:给定两个加权点集$ a $ a $ a $ a $ a $ a和$ b $在欧几里得空间$ \ mathbb {r}^d $中,我们想快速确定他们的地球搬运工(EMD)是否比预先指定的threshold $ t \ geq 0 $大于或小。该问题在机器学习和数据挖掘领域找到了许多重要的应用程序。特别是,我们假设尺寸$ d $不是固定的,而大小$ | a | $和$ | b | $很大。因此,由于它们的复杂性很高,大多数现有的EMD算法对于解决此问题的效率并不高。在这里,我们考虑了一个假设,即$ a $ a和$ b $具有较低的加倍维度,这对于现实世界中的高维数据很常见。受几何方法{\ em Net Tree}的启发,我们提出了一种新颖的``数据依赖性''算法,以避免直接计算$ a $ a $ a和$ b $之间的EMD,以便更有效地解决此查询问题。我们还研究了我们在合成和真实数据集方面的性能。实验结果表明,与现有的EMD算法相比,我们的方法可以节省大量的运行时间。
In this paper, we consider the following query problem: given two weighted point sets $A$ and $B$ in the Euclidean space $\mathbb{R}^d$, we want to quickly determine that whether their earth mover's distance (EMD) is larger or smaller than a pre-specified threshold $T\geq 0$. The problem finds a number of important applications in the fields of machine learning and data mining. In particular, we assume that the dimensionality $d$ is not fixed and the sizes $|A|$ and $|B|$ are large. Therefore, most of existing EMD algorithms are not quite efficient to solve this problem due to their high complexities. Here, we consider the problem under the assumption that $A$ and $B$ have low doubling dimensions, which is common for high-dimensional data in real world. Inspired by the geometric method {\em net tree}, we propose a novel ``data-dependent'' algorithm to avoid directly computing the EMD between $A$ and $B$, so as to solve this query problem more efficiently. We also study the performance of our method on synthetic and real datasets. The experimental results suggest that our method can save a large amount of running time comparing with existing EMD algorithms.