SparseP：在实际处理系统中，迈向有效的稀疏矩阵矢量乘法

论文标题

SparseP：在实际处理系统中，迈向有效的稀疏矩阵矢量乘法

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

论文作者

Giannoula, Christina, Fernandez, Ivan, Gómez-Luna, Juan, Koziris, Nectarios, Goumas, Georgios, Mutlu, Onur

论文摘要

几家制造商已经开始商业化近银行处理（PIM）架构。近银行PIM架构将简单的核心放置在靠近DRAM银行的核心，并通过减轻数据访问成本来对并行应用产生显着的性能和能量改进。真实的PIM系统可以提供高水平的并行性，较大的总内存带宽和低内存访问延迟，从而非常适合加速广泛使用的，内存的稀疏矩阵矢量乘法（SPMV）内核。本文提供了对现实世界PIM体系结构的SPMV的首次综合分析，并提供了SparseP，这是第一个用于真实PIM架构的SPMV库。我们做出三个关键贡献。首先，我们在SPMV上为多线程PIM核心实施了各种软件策略，并表征了单个多线PIM核心的计算限制。 Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory.第三，我们将现实世界中的SPMV执行与2528个PIM核心与最先进的CPU和GPU系统进行比较，以研究各种设备的性能和能源效率。 SparseP软件包提供了25个SPMV内核，用于支撑四种最广泛的压缩矩阵格式以及广泛的数据类型的真实PIM系统。我们的广泛评估为软件设计人员和硬件架构师提供了新的见解和建议，以有效地加速SPMV在真实的PIM系统上。

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to state-of-the-art CPU and GPU systems to study the performance and energy efficiency of various devices. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, and a wide range of data types. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate SpMV on real PIM systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题