论文标题
贝叶斯非参数估计覆盖概率和与草图数据的不同计数
Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data
论文作者
论文摘要
覆盖率概率的估计,尤其是缺失的质量,是许多科学领域的应用的经典统计问题。在本文中,我们研究了与随机数据压缩或素描有关的此问题。这是一个新颖但实际上相关的观点,它是指基于真实数据的压缩和不完善的摘要或草图估算覆盖概率的情况,因为无法直接观察到完整数据或不同符号的经验频率。我们的贡献是一种贝叶斯非参数方法,用于从随机哈希概述的数据中估算覆盖概率,这也解决了恢复真实数据中不同计数的挑战性问题,并具有特定的感兴趣经验频率。拟议的贝叶斯估计量很容易适用于大规模分析,结合了事先进行的迪里奇过程,尽管在更一般的Pitman-yor过程中涉及一些开放的计算挑战。我们方法的经验有效性通过数值实验和应用于Covid DNA序列,经典英语文献和IP地址的真实数据集的应用来证明。
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.