论文标题

大规模数据集的快速和完全自动的直方图

Fast and fully-automated histograms for large-scale data sets

论文作者

Mendizábal, Valentina Zelaya, Boullé, Marc, Rossi, Fabrice

论文摘要

G-Enum直方图是一种用于不规则直方图构建的新快速且完全自动化的方法。通过将直方图构造作为密度估计问题及其自动化作为模型选择任务,这些直方图利用最小描述长度原理(MDL)来得出两个不同的模型选择标准。关于这些标准的一些经过证明的理论结果给出了有关其渐近行为的见解,并用于加快其优化。这些见解与贪婪的搜索启发式结合在一起,用于在线性时间内构造直方图,而不是以前的作品产生的多项式时间。在综合和大型现实世界数据集的文献中,参考文献中的其他完全自动化的方法来说明所提出的MDL密度估计方法的功能。

G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insights about their asymptotic behavior and are used to speed up their optimisation. These insights, combined to a greedy search heuristic, are used to construct histograms in linearithmic time rather than the polynomial time incurred by previous works. The capabilities of the proposed MDL density estimation method are illustrated with reference to other fully automated methods in the literature, both on synthetic and large real-world data sets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源