表征和理解GPU的分布式GNN培训

论文标题

表征和理解GPU的分布式GNN培训

Characterizing and Understanding Distributed GNN Training on GPUs

论文作者

Lin, Haiyang, Yan, Mingyu, Yang, Xiaocheng, Zou, Mo, Li, Wenming, Ye, Xiaochun, Fan, Dongrui

论文摘要

图形神经网络（GNN）已被证明是许多领域的强大模型，因为它在图形上学习有效性。为了扩展大图的GNN训练，采用广泛采用的方法是分布式训练，该方法使用多个计算节点加速训练。最大程度地提高性能是必不可少的，但是分布式GNN培训的执行仍然是初步理解的。在这项工作中，我们对GPU的分布式GNN培训提供了深入的分析，揭示了一些重要的观察结果，并为软件优化和硬件优化提供了有用的指南。

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which accelerates training using multiple computing nodes. Maximizing the performance is essential, but the execution of distributed GNN training remains preliminarily understood. In this work, we provide an in-depth analysis of distributed GNN training on GPUs, revealing several significant observations and providing useful guidelines for both software optimization and hardware optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题