ALCOP：AI-GPU的深度学习编译器中的自动负载计算管道

论文标题

ALCOP：AI-GPU的深度学习编译器中的自动负载计算管道

ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

论文作者

Huang, Guyue, Bai, Yang, Liu, Liu, Wang, Yuke, Yu, Bei, Ding, Yufei, Xie, Yuan

论文摘要

数据加载和计算之间的管道是对GPU的关键张量程序优化。为了释放最新GPU的高性能，我们必须在GPU的多级缓冲层层次结构上执行多级管道的协同优化。现有的框架依靠手写的库（例如Cublas）执行管道优化，而新操作员则无法扩展，并且与先前的张量编译器优化相关。本文介绍了Alcop，这是编译器本地的第一个框架，并完全支持多阶段的多级管道。 ALCOP克服了生成管道代码的三个关键障碍：检测管道适用的缓冲区，用于多级多级管道的程序转换以及通过合并静态分析来实现多级多级的多级管道以及有效的时间表参数搜索。实验表明，ALCOP可以在Vanilla TVM上平均生成1.23倍加速度（最高1.73倍）的程序。在端到端型号上，Alcop可以在TVM上提高1.18倍，而XLA则可以提高1.64倍。此外，我们的性能模型大大提高了时间表调整过程的效率，并且可以找到详尽搜索效果的99％的时间表，同时进行了40倍的试验。

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the multi-level buffer hierarchy of GPU. Existing frameworks rely on hand-written libraries such as cuBLAS to perform pipelining optimization, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. This paper presents ALCOP, the first framework that is compiler-native and fully supports multi-stage multi-level pipelining. ALCOP overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23x speedup on average (up to 1.73x) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18x, and XLA by up to 1.64x. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99% of the performance given by exhaustive search while costing 40x fewer trials.

下载PDF全文

下载文献需遵守相关版权规定

论文标题