在滑动窗口中接近最佳重复检测

论文标题

在滑动窗口中接近最佳重复检测

Approaching Optimal Duplicate Detection in a Sliding Window

论文作者

Géraud-Stewart, Rémi, Lombard-Platet, Marius, Naccache, David

论文摘要

重复检测是识别仅当仅可用的内存时（可能是无限的）数据流中以前出现在（可能是无限的）数据流中的问题。不幸的是，无限的流设置不足，而重复检测过滤器的错误率受到严重约束：因此，它们似乎没有任何优势，而不是有偏见的硬币折腾[8]。在本文中，我们将[13,16]引入的滑动窗口设置形式化，并证明可以将完美的解决方案用于最大窗口大小$ W_ \ text {max} $。在此阈值之上，我们表明某些现有的重复检测过滤器（为$ \ textit {non-windowed} $设置设计）的表现要比针对窗口问题的人更好。最后，我们引入了一种“排队结构”，该构造改善了窗户设置中某些重复检测过滤器的性能。我们还在对抗性环境中分析了过滤器的安全性。

Duplicate detection is the problem of identifying whether a given item has previously appeared in a (possibly infinite) stream of data, when only a limited amount of memory is available. Unfortunately the infinite stream setting is ill-posed, and error rates of duplicate detection filters turn out to be heavily constrained: consequently they appear to provide no advantage, asymptotically, over a biased coin toss [8]. In this paper we formalize the sliding window setting introduced by [13,16], and show that a perfect (zero error) solution can be used up to a maximal window size $w_\text{max}$. Above this threshold we show that some existing duplicate detection filters (designed for the $\textit{non-windowed}$ setting) perform better that those targeting the windowed problem. Finally, we introduce a "queuing construction" that improves on the performance of some duplicate detection filters in the windowed setting. We also analyse the security of our filters in an adversarial setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题