大规模合成图数据集生成的框架

论文标题

大规模合成图数据集生成的框架

A Framework for Large Scale Synthetic Graph Dataset Generation

论文作者

Darabi, Sajad, Bigaj, Piotr, Majchrowski, Dawid, Kasymov, Artur, Morkisz, Pawel, Fit-Florea, Alex

论文摘要

最近，对于许多任务（例如欺诈检测和推荐系统）开发和部署深图学习算法的兴趣越来越大。尽管有限的公开图形结构化数据集，但与生产大小的应用程序相比，其中大多数是很小的，或者在其应用程序域中受到限制。这项工作通过提出可扩展的合成图生成工具来解决这一缺点，以将数据集扩展到具有数万亿个边缘和数十亿节点的生产尺寸图。该工具从专有数据集中学习了一系列参数模型，这些模型可以向研究人员释放，以研究有关综合数据增加原型开发和新颖应用的各种图形方法。我们证明了该框架在一系列数据集中的概括性，模仿结构和特征分布，以及能够在各种尺寸上扩展它们，以证明它们对基准测试和模型开发的有用性。可以在https://github.com/nvidia/deeplearningexamples/tree/master/master/master/tools/dglpytorch/syntheticgraphegeneration上找到代码。

Recently there has been increasing interest in developing and deploying deep graph learning algorithms for many tasks, such as fraud detection and recommender systems. Albeit, there is a limited number of publicly available graph-structured datasets, most of which are tiny compared to production-sized applications or are limited in their application domain. This work tackles this shortcoming by proposing a scalable synthetic graph generation tool to scale the datasets to production-size graphs with trillions of edges and billions of nodes. The tool learns a series of parametric models from proprietary datasets that can be released to researchers to study various graph methods on the synthetic data increasing prototype development and novel applications. We demonstrate the generalizability of the framework across a series of datasets, mimicking structural and feature distributions as well as the ability to scale them across varying sizes demonstrating their usefulness for benchmarking and model development. Code can be found on https://github.com/NVIDIA/DeepLearningExamples/tree/master/Tools/DGLPyTorch/SyntheticGraphGeneration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题