SPATZ：高性能和节能共享L1簇的紧凑型矢量处理单元

论文标题

SPATZ：高性能和节能共享L1簇的紧凑型矢量处理单元

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

论文作者

Cavalcante, Matheus, Wüthrich, Domenic, Perotti, Matteo, Riedel, Samuel, Benini, Luca

论文摘要

尽管基于处理元素（PES）共享L1内存的群集的并行体系结构是广泛的，但就其PE的倾斜程度尚无共识。当矢量处理器的架构PES具有大大减少其指导的带宽，从而减轻Von Neumann瓶颈（VNB）的承诺。但是，由于它们与超级计算机的历史关联，经典矢量机包括改善教学水平并行性（ILP）的微构造技巧，从而增加了他们的指导提取和解码能量开销。在本文中，我们首次探讨了向量处理，作为为大规模共享L1簇构建小而有效的PE的一种选择。我们提出了Spatz，这是一个基于RISC-V矢量扩展版1.0版的整数嵌入式子集的紧凑，32位矢量处理单元。一个基于SPATZ的群集，具有四个多重蓄积单元（MACUS），每32位整数多重蓄能操作仅需7.9 pj，而能量要比使用四个无机标态核心的等效群集少40％。我们通过将SPATZ的性能集成到Mempool（大规模多核共享L1群集）中来分析SPATZ的性能。运行256x256 32位整数矩阵乘法时，基于SPATZ的Mempool系统最多可实现285个GOP，比基于同等的Snitch Mempool System高70％。在能源效率方面，基于SPATZ的MEMPOOL系统运行相同的内核时可达到266个GOPS/W，这是基于单独的Mempool系统的能源效率的两倍以上，该系统达到128 GOPS/W。这些结果表明，对于具有紧密耦合的L1内存的大规模簇，精益矢量处理器作为高性能和节能PE的生存能力。

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题