一个特性浮点群集群，用于有效的近传感器数据分析

论文标题

一个特性浮点群集群，用于有效的近传感器数据分析

A transprecision floating-point cluster for efficient near-sensor data analytics

论文作者

Montagna, Fabio, Mach, Stefan, Benatti, Simone, Garofalo, Angelo, Ottavi, Gianmarco, Benini, Luca, Rossi, Davide, Tagliavini, Giuseppe

论文摘要

近传感器计算域中的最新应用需要采用浮点算术，以使高精度结果与广泛的动态范围调和。在本文中，我们提出了一个多核计算集群，该群集利用罚款的颗粒可调式计算原理，以最低功率预算以最小的功率预算为近传感器应用程序提供支持。我们的设计（基于开源RISC-V架构）将并行化和子字矢量化与接近阈值的操作相结合，从而导致高度可扩展和通用的系统。我们在循环精确的FPGA仿真器上对Transprecision群集的设计空间进行了详尽的探索，目的是确定在性能，能源效率和面积效率方面最有效的配置。我们还提供完整的软件堆栈支持，包括并行运行时和编译工具链，以实现端到端应用程序的开发。我们在代表近传感器处理域的一组基准测试基准上对设计进行了实验评估，并通过对功耗的后位置分析来补充定时结果。最后，与最先进的比较表明，我们的解决方案的表现优于竞争对手的能效，在单精度标量上达到97 GFLOP/S/W的峰值，半精度向量的162 Gflop/s/w。

Recent applications in the domain of near-sensor computing require the adoption of floating-point arithmetic to reconcile high precision results with a wide dynamic range. In this paper, we propose a multi-core computing cluster that leverages the fined-grained tunable principles of transprecision computing to provide support to near-sensor applications at a minimum power budget. Our design - based on the open-source RISC-V architecture - combines parallelization and sub-word vectorization with near-threshold operation, leading to a highly scalable and versatile system. We perform an exhaustive exploration of the design space of the transprecision cluster on a cycle-accurate FPGA emulator, with the aim to identify the most efficient configurations in terms of performance, energy efficiency, and area efficiency. We also provide a full-fledged software stack support, including a parallel runtime and a compilation toolchain, to enable the development of end-to-end applications. We perform an experimental assessment of our design on a set of benchmarks representative of the near-sensor processing domain, complementing the timing results with a post place-&-route analysis of the power consumption. Finally, a comparison with the state-of-the-art shows that our solution outperforms the competitors in energy efficiency, reaching a peak of 97 Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision vectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题