小队：对于每个项目的分位数估计，将草图和采样组合要好

论文标题

小队：对于每个项目的分位数估计，将草图和采样组合要好

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

论文作者

Shahout, Rana, Friedman, Roy, Basat, Ran Ben

论文摘要

流监视在许多数据流应用程序中都是基本的，例如财务数据跟踪器，安全性，异常检测和负载平衡。在这方面，分位数特别感兴趣，因为它们经常捕获用户的实用程序。例如，如果视频连接具有较高的尾部潜伏期，即使平均潜伏期低，感知的质量也会受到影响。在这项工作中，我们考虑了近似每个项目的问题。流中的元素是（ID，延迟）元组，我们希望跟踪每个ID的延迟分位数。现有的分位数草图是为单个数字流设计的（例如，仅包含延迟）。虽然可以为每个ID分配一个单独的草图实例，但这可能需要不可行的内存。取而代之的是，我们考虑跟踪重型击球手的分位数（最常见的物品），这些分位数通常被认为特别重要，而无需事先知道它们。我们首先提出了一种简单的采样算法，该算法是基准。然后，我们设计了一种算法，该算法在重型击球手算法的每个条目中增加了一个分位数草图，从而产生了相似的空间复杂性，但具有确定性的错误保证。最后，我们提出了小队，这种方法结合了采样和素描，同时提高了渐近空间复杂性。直观地，小队使用背景抽样过程来捕获项目潜伏期的行为，然后再用草图分配，从而使我们可以使用较少的样本和草图。我们的解决方案进行了严格的分析，我们使用广泛的模拟证明了方法的优越性。

Stream monitoring is fundamental in many data stream applications, such as financial data trackers, security, anomaly detection, and load balancing. In that respect, quantiles are of particular interest, as they often capture the user's utility. For example, if a video connection has high tail latency, the perceived quality will suffer, even if the average and median latencies are low. In this work, we consider the problem of approximating the per-item quantiles. Elements in our stream are (ID, latency) tuples, and we wish to track the latency quantiles for each ID. Existing quantile sketches are designed for a single number stream (e.g., containing just the latency). While one could allocate a separate sketch instance for each ID, this may require an infeasible amount of memory. Instead, we consider tracking the quantiles for the heavy hitters (most frequent items), which are often considered particularly important, without knowing them beforehand. We first present a simple sampling algorithm that serves as a benchmark. Then, we design an algorithm that augments a quantile sketch within each entry of a heavy hitter algorithm, resulting in similar space complexity but with a deterministic error guarantee. Finally, we present SQUAD, a method that combines sampling and sketching while improving the asymptotic space complexity. Intuitively, SQUAD uses a background sampling process to capture the behaviour of the latencies of an item before it is allocated with a sketch, thereby allowing us to use fewer samples and sketches. Our solutions are rigorously analyzed, and we demonstrate the superiority of our approach using extensive simulations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题