论文标题
DISTNN-MB:通过Minibatch采样的分布式大规模图神经网络培训
DistGNN-MB: Distributed Large-Scale Graph Neural Network Training on x86 via Minibatch Sampling
论文作者
论文摘要
训练图神经网络在包含数十亿个顶点和边缘的图表上,使用Minibatch采样大规模提出了一个关键挑战:强大的图和训练示例会导致较低的计算和较高的通信量和潜在的性能损失。 Distnn-MB采用了一种新颖的历史嵌入缓存,并结合了计算通信重叠,以应对这一挑战。在32个节点(64个插座)群集中,$ 3^{rd} $ Generation Intel Xeon可伸缩处理器,每个插座36个内核,Distgnn-MB在OGBN-PAPERS100M上的3层图形和GAT型号,以2秒和4.9秒和4.9秒的速度收敛,分别在32秒内,在32秒内,分别为2秒的时间。在此规模上,Distnn-MB比广泛使用的DistDGL训练5.2倍的图形。 Distgnn-MB分别训练图形和GAT 10x和17.2倍,以计算节点比例从2到32。
Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.