论文标题
超过FP32理论峰值性能,从张量芯中恢复单个精度精度
Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
论文作者
论文摘要
Tensor Core是NVIDIA GPU上的混合精液矩阵乘积单元,理论上的峰值性能在安培体系结构上超过300 tflop/s。响应机器学习对密集矩阵乘法的高需求而开发了张量芯。但是,许多在科学计算中的应用,例如迭代求解器的预处理和低精确的傅立叶变换都可以利用这些张量芯。要计算张量芯上的矩阵乘法,我们需要将输入矩阵转换为半精度,从而导致准确性损失。为了避免这种情况,我们可以使用其他半精度变量将Mantissa损失保持在转换中,并使用它们来纠正矩阵矩阵乘法的准确性。即使使用这种校正,与FP32 SIMT核心相比,使用张量芯的使用也会产生更高的吞吐量。然而,仅此方法的校正能力是有限的,并且所得的精度不能匹配fp32 Simt内核上矩阵乘法的精度。我们解决了这个问题,并使用张量核心实现了高精度,高性能和低功耗矩阵乘法实现,这与FP32 Simt内核的准确性完全匹配,同时实现了出色的吞吐量。该实现基于Nvidia的Cutlass。我们发现,实现此准确性的关键是如何处理张量芯内的舍入和校正计算过程中的底流概率。我们的实现使用FP16张量芯和33Tflop/s的实现实现了51Tflop/s,用于使用NVIDIA A100 GPU上的TF32张量核的完整指数范围的FP32的33Tflop/s,该范围超过了19.5tflop/s的理论FP32 SIMT SIMT核心峰的性能。
Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix-matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA's CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51TFlop/s for a limited exponent range using FP16 Tensor Cores and 33TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5TFlop/s.