优化多核DSP上的不规则矩阵矩阵乘法

论文标题

优化多核DSP上的不规则矩阵矩阵乘法

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

论文作者

Yin, Shangfei, Wang, Qinglin, Hao, Ruochen, Zhou, Tianyang, Mei, Songzhu, Liu, Jie

论文摘要

一般矩阵乘法（GEMM）在科学模拟和人工智能中具有广泛的应用。尽管传统图书馆可以在大型常规宝石上实现高性能，但它们通常在不规则形状的宝石上表现不佳，这些宝石通常在新算法和高性能计算的应用中发现。由于能源效率的限制，低功耗多核数字信号处理器（DSP）已成为HPC系统中的替代体系结构。针对FT-M7032中的多核DSP（一种用于HPC的原型CPU-DSP异质处理器）提出了一种有效的实现 - FTIMM-用于三种类型的不规则形状的GEMM。 FTIMM支持自动生成组装微主内物，两种并行化策略以及块大小和并行化策略的自动调整。实验表明，与FT-M7032中多核DSP上传统的GEMM实现相比，FTIMM可以获得更好的性能，在不规则形状的GEMM上进行表演时，可以提高7.2倍的性能。多核DSP上的FTIMM还可以远远超过FT-M7032中多核CPU上的开源库，效率高3.1倍。

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FT-m7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2x performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 3.1x higher efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题