隐藏的马尔可夫·波利亚（MarkovPólya）树木用于高维分布

论文标题

隐藏的马尔可夫·波利亚（MarkovPólya）树木用于高维分布

Hidden Markov Pólya trees for high-dimensional distributions

论文作者

Awaya, Naoki, Ma, Li

论文摘要

Pólya树（PT）过程是一种通用贝叶斯非参数模型，在一系列推理问题中发现了广泛的应用。它具有简单的分析形式，后验计算归结为沿样品空间沿隔板树沿分区树的β-二元组合更新。 PT模型中的最新开发表明，通过（i）允许分区树适应基础分布的结构以及（ii）结合表征基础分布的本地特征的潜在状态变量，可以大大提高这些模型的性能。但是，PT的重要局限性仍然存在，包括（i）后推断对分区树的选择的灵敏度，以及（ii）相对于样品空间维度的缺乏可伸缩性。我们考虑了PT模型的建模策略，该策略与Markov依赖性的潜在状态一起在分区树上包含了灵活的先验。我们引入了一种混合算法，结合了顺序蒙特卡洛（SMC）和递归消息，以传递后验采样，该采样可扩展到100个维度。尽管我们对算法的描述假设了一个计算机环境，但它有可能在分布式系统上实现，以进一步增强可扩展性。此外，我们研究了后验模型下树结构和潜在状态的大样本特性。我们对密度估计和两组比较进行了广泛的数值实验，这表明柔性分配可以显着改善PT模型在这两个推理任务中的性能。我们演示了具有19个维度和超过200,000个观测值的质量细胞仪数据集的应用。

The Pólya tree (PT) process is a general-purpose Bayesian nonparametric model that has found wide application in a range of inference problems. It has a simple analytic form and the posterior computation boils down to beta-binomial conjugate updates along a partition tree over the sample space. Recent development in PT models shows that performance of these models can be substantially improved by (i) allowing the partition tree to adapt to the structure of the underlying distributions and (ii) incorporating latent state variables that characterize local features of the underlying distributions. However, important limitations of the PT remain, including (i) the sensitivity in the posterior inference with respect to the choice of the partition tree, and (ii) the lack of scalability with respect to dimensionality of the sample space. We consider a modeling strategy for PT models that incorporates a flexible prior on the partition tree along with latent states with Markov dependency. We introduce a hybrid algorithm combining sequential Monte Carlo (SMC) and recursive message passing for posterior sampling that can scale up to 100 dimensions. While our description of the algorithm assumes a single computer environment, it has the potential to be implemented on distributed systems to further enhance the scalability. Moreover, we investigate the large sample properties of the tree structures and latent states under the posterior model. We carry out extensive numerical experiments in density estimation and two-group comparison, which show that flexible partitioning can substantially improve the performance of PT models in both inference tasks. We demonstrate an application to a mass cytometry data set with 19 dimensions and over 200,000 observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题