论文标题
向独特的候选人学习,以优化张量核心的简化精确卷积计划
Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores
论文作者
论文摘要
卷积是具有苛刻的矩阵计算的深神经网络的基本操作之一。在图形处理单元(GPU)中,Tensor Core是一种专门的矩阵处理硬件,配备了降低精确的矩阵 - 元素蓄积(MMA)指令,以增加吞吐量。但是,实现最佳性能是一项挑战,因为MMA指令的最佳调度对于不同的卷积大小而有所不同。特别是,降低过度的MMA需要将许多元素分组为矩阵操作数,从而严重限制了数据并在计划中施加包装和布局开销。这项工作提出了一种自动调度方法,用于减少卷积操作的精确MMA。在这种方法中,我们设计了一个搜索螺纹瓷砖和经翘曲大小的搜索空间,尽管降低了精确的MMA,但仍可以增加数据重用。搜索空间还包括寄存器级包装和布局优化的选项,用于处理简化精确数据的开销。最后,我们提出了一种搜索算法,通过向独特的候选人学习来找到最佳的时间表。与流行神经网络的卷积操作评估了这种降低的MMA优化方法,以证明与搜索时间缩短的艺术状态相比,对张量的核心进行了实质性加速。
Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision matrix-multiply-accumulate (MMA) instructions to increase throughput. However, it is challenging to achieve optimal performance since the best scheduling of MMA instructions varies for different convolution sizes. In particular, reduced-precision MMA requires many elements grouped as a matrix operand, seriously limiting data reuse and imposing packing and layout overhead on the schedule. This work proposes an automatic scheduling method of reduced-precision MMA for convolution operation. In this method, we devise a search space that explores the thread tile and warp sizes to increase the data reuse despite a large matrix operand of reduced-precision MMA. The search space also includes options of register-level packing and layout optimization to lesson overhead of handling reduced-precision data. Finally, we propose a search algorithm to find the best schedule by learning from the distinctive candidates. This reduced-precision MMA optimization method is evaluated on convolution operations of popular neural networks to demonstrate substantial speedup on Tensor Core compared to the state of the arts with shortened search time.