论文标题
如何计算三角形,而无需看到整个图
How to Count Triangles, without Seeing the Whole Graph
论文作者
论文摘要
三角计数是大图分析中的一个基本问题。在不同的流和分布式模型中,关于这个问题有很多工作,但是所有这些算法都需要阅读整个输入图。在许多情况下,我们无法访问整个图表,只能采样图表的一小部分(通常是通过爬行)。在这种情况下,我们如何准确估计图的三角计数? 我们正式研究了Dasgupta等人(www '14)和Chierichetti等人(www '16)引入的{\ em随机步行}访问模型的三角计数。我们可以访问该图的任意种子顶点,并且只能进行随机步行。该模型在访问方面具有限制性,并捕捉了收集现实图形图的挑战。在此模型中,即使对统一的随机顶点进行采样也是一项艰巨的任务。 尽管面临这些挑战,但我们设计了一种可证明且实用的算法,即俄罗斯方块,用于该模型中的三角形计数。俄罗斯方块是第一种可证明的sublinear算法(对于大多数自然参数设置),该算法近似于随机步行模型中的三角计数,用于较低的混合时间的图。我们的结果是基于标准算法理论的最新进展。俄罗斯方块制造的最终样本是仔细的随机步行和邻里偏向抽样的混合。从经验上讲,俄罗斯方块在各种大图上精确计数三角形,通过查看边缘数的3%来获得5 \%相对误差的估计。
Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph? We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model. Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.