FPNew：用于能量预先推出计算的开源多格式浮点单元架构

论文标题

FPNew：用于能量预先推出计算的开源多格式浮点单元架构

FPnew: An Open-Source Multi-Format Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

论文作者

Mach, Stefan, Schuiki, Fabian, Zaruba, Florian, Benini, Luca

论文摘要

摩尔定律和电壁的放缓需要转向可调的精度（又名Transprecision）计算，以减少能量足迹。因此，我们需要能够在具有较高能量的范围内进行浮动点操作的电路。我们提出了FPNew，这是一种高度可配置的开源外部浮点数（TP-FPU），能够支持广泛的标准和自定义FP格式。为了证明在通用处理器体系结构中FPNEW的灵活性和效率，我们扩展了RISC-V ISA，并在半精确的BFLOAT16和8位FP格式以及SIMD矢量和多格式操作上进行了操作。集成到32位RISC-V核心中，我们的TP-FPU可以加快混合精液应用程序的执行，而1.67x W.R.T. FP32基线，同时将端到端的精度保持在37％。我们还将FPNew集成到64位RISC-V核心中，支持标量或2、4或8向Vectors上的五种FP格式。对于此核心，我们测量了在全球构造中生产的硅22FDX技术，其宽电压范围从0.45V到1.2V。该单元在178 GFLOP/SW（在FP64上）和2.95 TFLOP/SW（在8位迷你爆发上）以及3.2 Gflop/s和25.3 Gflop/s的性能达到了领先的能量效率。

The slowdown of Moore's law and the power wall necessitates a shift towards finely tunable precision (a.k.a. transprecision) computing to reduce energy footprint. Hence, we need circuits capable of performing floating-point operations on a wide range of precisions with high energy-proportionality. We present FPnew, a highly configurable open-source transprecision floating-point unit (TP-FPU) capable of supporting a wide range of standard and custom FP formats. To demonstrate the flexibility and efficiency of FPnew in general-purpose processor architectures, we extend the RISC-V ISA with operations on half-precision, bfloat16, and an 8bit FP format, as well as SIMD vectors and multi-format operations. Integrated into a 32-bit RISC-V core, our TP-FPU can speed up execution of mixed-precision applications by 1.67x w.r.t. an FP32 baseline, while maintaining end-to-end precision and reducing system energy by 37%. We also integrate FPnew into a 64-bit RISC-V core, supporting five FP formats on scalars or 2, 4, or 8-way SIMD vectors. For this core, we measured the silicon manufactured in Globalfoundries 22FDX technology across a wide voltage range from 0.45V to 1.2V. The unit achieves leading-edge measured energy efficiencies between 178 Gflop/sW (on FP64) and 2.95 Tflop/sW (on 8-bit mini-floats), and a performance between 3.2 Gflop/s and 25.3 Gflop/s.

下载PDF全文

下载文献需遵守相关版权规定

论文标题