改进\ textit {tug of-of-of-of-of-of-of-of-of-of tak sketch sherk sherk sherk sherk farriates方法

论文标题

改进\ textit {tug of-of-of-of-of-of-of-of-of-of tak sketch sherk sherk sherk sherk farriates方法

Improving \textit{Tug-of-War} sketch using Control-Variates method

论文作者

Pratap, Rameshwar, Verma, Bhisham Dev, Kulkarni, Raghav

论文摘要

计算空间效率摘要，或\ textit {a.k.a。大数据的草图}是流算法中的一个核心问题。此类草图用于在几个数据分析任务中回答\ textit {post-hoc}查询。用于计算草图的算法通常需要快速，准确和空间效率。流算法框架中的一个基本问题是计算数据流的频率矩。包含$ f_i $元素的序列的频率矩是数字$ \ mathbf {f} _k = \ sum_ {i = 1}^n {f_i}^k，$ i \ in [n] $。这也称为频率向量$的$ \ ell_k $（f_1，f_2，\ ldots f_n）。$另一个重要问题是通过计算相应频率向量的内部产品来计算两个数据流之间的相似性。 Alon，Matias和Szegedy〜 \ cite {ams}，\ textit {a.k.a的开创性工作。拔河}（或AMS）草图提供了用于计算频率矩的随机sublinear空间（和线性时间）算法，以及与数据流相对应的两个频率向量之间的内部产物。但是，这些估计的方差通常往往很大。在这项工作中，我们专注于最大程度地减少这些估计的差异。我们使用经典的控制变量方法〜\ cite {lavenberg}的技术，该方法主要以降低蒙特 - 卡洛模拟的差异而闻名，因此，我们能够以少量计算开销的成本来获得显着的差异。我们对我们的建议进行了理论分析，并通过对合成和现实世界数据集的支持实验进行补充。

Computing space-efficient summary, or \textit{a.k.a. sketches}, of large data, is a central problem in the streaming algorithm. Such sketches are used to answer \textit{post-hoc} queries in several data analytics tasks. The algorithm for computing sketches typically requires to be fast, accurate, and space-efficient. A fundamental problem in the streaming algorithm framework is that of computing the frequency moments of data streams. The frequency moments of a sequence containing $f_i$ elements of type $i$, are the numbers $\mathbf{F}_k=\sum_{i=1}^n {f_i}^k,$ where $i\in [n]$. This is also called as $\ell_k$ norm of the frequency vector $(f_1, f_2, \ldots f_n).$ Another important problem is to compute the similarity between two data streams by computing the inner product of the corresponding frequency vectors. The seminal work of Alon, Matias, and Szegedy~\cite{AMS}, \textit{a.k.a. Tug-of-war} (or AMS) sketch gives a randomized sublinear space (and linear time) algorithm for computing the frequency moments, and the inner product between two frequency vectors corresponding to the data streams. However, the variance of these estimates typically tends to be large. In this work, we focus on minimizing the variance of these estimates. We use the techniques from the classical Control-Variate method~\cite{Lavenberg} which is primarily known for variance reduction in Monte-Carlo simulations, and as a result, we are able to obtain significant variance reduction, at the cost of a little computational overhead. We present a theoretical analysis of our proposal and complement it with supporting experiments on synthetic as well as real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题