论文标题
基于样本的查询的尖锐频率边界
Sharp Frequency Bounds for Sample-Based Queries
论文作者
论文摘要
数据草图算法扫描一个大数据集,收集少量数据 - 该草图可用于从统计上推断大数据集的属性。一些数据草图算法对大数据集进行了固定尺寸的随机样本,并使用该样本来推断符合大数据集中各种标准的项目频率。本文展示了如何有效且准确地推断出统计学上的近似正确(PAC)边界,以至于频率边界的频率界限是尖锐的或仅一个一个频率,这是最佳的结果,而无需精确计算。
A data sketch algorithm scans a big data set, collecting a small amount of data -- the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.