论文标题
通过随机行程分布进行图形和点云的本地两样本测试
Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions
论文作者
论文摘要
在两样本测试中拒绝零假设是科学发现的基本工具。然而,除了结论两个样本不是来自相同的概率分布之外,通常有趣的是表征两个分布的不同之处。给定两个密度的样品$ f_1 $和$ f_0 $,我们考虑了本地化不平等现象$ f_1> f_0 $的任务。为了避免与高维空间相关的挑战,我们提出了一个一般的假设检验框架,在该框架中,通过对两个密度的合并样品进行调节,可以适应数据。然后,我们研究了该框架的特殊情况,在该框架中,该框架的概念是通过在此组合样本上构造的加权图上随机行走捕获的。我们采用一种扫描统计量的类型的扫描统计量来得出了可拖动的测试程序,并在我们的测试的功率和准确性上提供了非反应下限,以检测$ f_1> f_1> f_0 $在本地意义上。此外,我们根据某个问题硬度参数来表征测试的一致性,并表明我们的测试达到了此参数的最小值检测率。我们进行数值实验以验证我们的方法,并证明我们对两种现实世界应用的方法:在美国检测和定位砷井污染,并分析来自黑色素瘤患者的两样本单细胞RNA测序数据。
Rejecting the null hypothesis in two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest to characterize how the two distributions differ. Given samples from two densities $f_1$ and $f_0$, we consider the task of localizing occurrences of the inequality $f_1 > f_0$. To avoid the challenges associated with high-dimensional space, we propose a general hypothesis testing framework where hypotheses are formulated adaptively to the data by conditioning on the combined sample from the two densities. We then investigate a special case of this framework where the notion of locality is captured by a random walk on a weighted graph constructed over this combined sample. We derive a tractable testing procedure for this case employing a type of scan statistic, and provide non-asymptotic lower bounds on the power and accuracy of our test to detect whether $f_1>f_0$ in a local sense. Furthermore, we characterize the test's consistency according to a certain problem-hardness parameter, and show that our test achieves the minimax detection rate for this parameter. We conduct numerical experiments to validate our method, and demonstrate our approach on two real-world applications: detecting and localizing arsenic well contamination across the United States, and analyzing two-sample single-cell RNA sequencing data from melanoma patients.