GC3：GPU集体沟通的优化编译器

论文标题

GC3：GPU集体沟通的优化编译器

GC3: An Optimizing Compiler for GPU Collective Communication

论文作者

Cowan, Meghan, Maleki, Saeed, Musuvathi, Madanlal, Saarikivi, Olli, Xiong, Yifan

论文摘要

由数百万或数十亿个参数组成的机器学习模型经过培训并在大型多GPU系统上提供。随着模型的规模增长并在更多的GPU上执行，这些应用程序中使用的集体通信变成了瓶颈。针对特定网络拓扑和特定应用程序的特定通信模式优化的自定义集体算法可以减轻此瓶颈并帮助这些应用程序规模。但是，正确有效地实施自定义算法是具有挑战性的。本文介绍了GC3，这是一种可编程GPU通信的系统。 GC3提供了一种特定的域语言，用于编写集体通信算法和优化编译器，以将其降低到可执行的形式，可以在基于解释器的运行时有效，灵活地执行。我们使用GC3为Allreduce编写了新颖的集体算法和AlltoAll，其$ 1.9 \ times $和$ 1.3 \ times $ $ $ $ $ $比手工精制的实现更快。

Machine learning models made up of millions or billions of parameters are trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications become a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can alleviate this bottleneck and help these applications scale. However, correctly and efficiently implementing custom algorithms is challenging. This paper introduces GC3, a system for programmable GPU communication. GC3 provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter based runtime. We used GC3 to write novel collective algorithms for AllReduce and AllToAll that are up to $1.9\times$ and $1.3\times$ faster than hand-optimized implementations, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题