对图神经网络的准确，高效且可扩展的训练

论文标题

对图神经网络的准确，高效且可扩展的训练

Accurate, Efficient and Scalable Training of Graph Neural Networks

论文作者

Zeng, Hanqing, Zhou, Hongkuan, Srivastava, Ajitesh, Kannan, Rajgopal, Prasanna, Viktor

论文摘要

图形神经网络（GNN）是强大的深度学习模型，可以在图上生成节点嵌入。在大图上应用深度GNN时，以有效且可扩展的方式进行培训仍然具有挑战性。我们提出了一个新颖的平行训练框架。通过将小的子图作为小型捕获，我们通过最先进的Minibatch方法来减少训练工作量的数量级。然后，我们在紧密耦合的共享内存系统上并行化关键计算步骤。对于图形采样，我们在采样器实例内外利用并行性，并提出了一个有效的数据结构，以支持从采样器的同时访问。从理论上讲，平行采样器就处理单元的数量实现了接近线性的加速。对于子图中的特征传播，我们通过数据分配来改善缓存利用率并减少DRAM流量。我们的分区是一种2-鉴定策略，用于最大程度地降低与最佳的交流成本。我们进一步开发一个运行时调度程序来重新排序培训操作并调整Minibatch子图以提高并行性能。最后，我们概括了上述并行化策略，以支持多种类型的GNN模型和图形采样器。拟议的培训同时超出了最先进的可扩展性，效率和准确性。与串行实现相比，在40核Xeon平台上，我们在采样步骤中实现了60倍的加速（带有AVX），在功能传播步骤中达到20倍加速。我们的算法可以快速训练更深层次的GNN，这是与张力流实施相比的数量级加速顺序所证明的。我们在https://github.com/graphsaint/graphsaint上开放代码。

Graph Neural Networks (GNNs) are powerful deep learning models to generate node embeddings on graphs. When applying deep GNNs on large graphs, it is still challenging to perform training in an efficient and scalable way. We propose a novel parallel training framework. Through sampling small subgraphs as minibatches, we reduce training workload by orders of magnitude compared with state-of-the-art minibatch methods. We then parallelize the key computation steps on tightly-coupled shared memory systems. For graph sampling, we exploit parallelism within and across sampler instances, and propose an efficient data structure supporting concurrent accesses from samplers. The parallel sampler theoretically achieves near-linear speedup with respect to number of processing units. For feature propagation within subgraphs, we improve cache utilization and reduce DRAM traffic by data partitioning. Our partitioning is a 2-approximation strategy for minimizing the communication cost compared to the optimal. We further develop a runtime scheduler to reorder the training operations and adjust the minibatch subgraphs to improve parallel performance. Finally, we generalize the above parallelization strategies to support multiple types of GNN models and graph samplers. The proposed training outperforms the state-of-the-art in scalability, efficiency and accuracy simultaneously. On a 40-core Xeon platform, we achieve 60x speedup (with AVX) in the sampling step and 20x speedup in the feature propagation step, compared to the serial implementation. Our algorithm enables fast training of deeper GNNs, as demonstrated by orders of magnitude speedup compared to the Tensorflow implementation. We open-source our code at https://github.com/GraphSAINT/GraphSAINT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题