论文标题
在滑动窗口中接近最佳重复检测
Approaching Optimal Duplicate Detection in a Sliding Window
论文作者
论文摘要
重复检测是识别仅当仅可用的内存时(可能是无限的)数据流中以前出现在(可能是无限的)数据流中的问题。 不幸的是,无限的流设置不足,而重复检测过滤器的错误率受到严重约束:因此,它们似乎没有任何优势,而不是有偏见的硬币折腾[8]。 在本文中,我们将[13,16]引入的滑动窗口设置形式化,并证明可以将完美的解决方案用于最大窗口大小$ W_ \ text {max} $。在此阈值之上,我们表明某些现有的重复检测过滤器(为$ \ textit {non-windowed} $设置设计)的表现要比针对窗口问题的人更好。最后,我们引入了一种“排队结构”,该构造改善了窗户设置中某些重复检测过滤器的性能。 我们还在对抗性环境中分析了过滤器的安全性。
Duplicate detection is the problem of identifying whether a given item has previously appeared in a (possibly infinite) stream of data, when only a limited amount of memory is available. Unfortunately the infinite stream setting is ill-posed, and error rates of duplicate detection filters turn out to be heavily constrained: consequently they appear to provide no advantage, asymptotically, over a biased coin toss [8]. In this paper we formalize the sliding window setting introduced by [13,16], and show that a perfect (zero error) solution can be used up to a maximal window size $w_\text{max}$. Above this threshold we show that some existing duplicate detection filters (designed for the $\textit{non-windowed}$ setting) perform better that those targeting the windowed problem. Finally, we introduce a "queuing construction" that improves on the performance of some duplicate detection filters in the windowed setting. We also analyse the security of our filters in an adversarial setting.