论文标题

优化多核DSP上的不规则矩阵矩阵乘法

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

论文作者

Yin, Shangfei, Wang, Qinglin, Hao, Ruochen, Zhou, Tianyang, Mei, Songzhu, Liu, Jie

论文摘要

一般矩阵乘法(GEMM)在科学模拟和人工智能中具有广泛的应用。尽管传统图书馆可以在大型常规宝石上实现高性能,但它们通常在不规则形状的宝石上表现不佳,这些宝石通常在新算法和高性能计算的应用中发现。由于能源效率的限制,低功耗多核数字信号处理器(DSP)已成为HPC系统中的替代体系结构。针对FT-M7032中的多核DSP(一种用于HPC的原型CPU-DSP异质处理器)提出了一种有效的实现 - FTIMM-用于三种类型的不规则形状的GEMM。 FTIMM支持自动生成组装微主内物,两种并行化策略以及块大小和并行化策略的自动调整。实验表明,与FT-M7032中多核DSP上传统的GEMM实现相比,FTIMM可以获得更好的性能,在不规则形状的GEMM上进行表演时,可以提高7.2倍的性能。多核DSP上的FTIMM还可以远远超过FT-M7032中多核CPU上的开源库,效率高3.1倍。

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FT-m7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2x performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 3.1x higher efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源