改善卷积神经网络加速器中的内存利用率

论文标题

改善卷积神经网络加速器中的内存利用率

Improving Memory Utilization in Convolutional Neural Network Accelerators

论文作者

Jokic, Petar, Emery, Stephane, Benini, Luca

论文摘要

尽管卷积神经网络的准确性通过引入更大，更深的网络体系结构取得了巨大的改进，但存储其参数和激活的内存足迹也有所增加。这种趋势特别挑战了功率和资源有限的加速器设计，这些设计通常仅限于将所有网络数据存储在片上内存中，以避免接口渴望能量的外部记忆。因此，最大化适合给定加速器上的网络大小需要最大程度地利用其内存利用率。尽管传统上使用的乒乓缓冲技术是将随后的激活层映射到分离的内存区域，但我们提出了一种映射方法，允许这些区域重叠并因此更有效地利用内存。这项工作介绍了数学模型，以计算最大激活存储器重叠，因此，在内存限制的加速器上对卷积神经网络进行一层处理所需的片上存储器的下限。我们使用各种现实世界对象检测器网络进行的实验表明，与传统的乒乓球缓冲相比，提出的映射技术可以将激活记忆降低高达32.9％，最多将整个网络的总体内存降低23.9％。对于更高分辨率的撤销网络，我们实现了48.8％的激活记忆节省。此外，我们在基于FPGA的相机上实现了面部检测器网络，以验证这些内存节省在完整的端到端系统上。

While the accuracy of convolutional neural networks has achieved vast improvements by introducing larger and deeper network architectures, also the memory footprint for storing their parameters and activations has increased. This trend especially challenges power- and resource-limited accelerator designs, which are often restricted to store all network data in on-chip memory to avoid interfacing energy-hungry external memories. Maximizing the network size that fits on a given accelerator thus requires to maximize its memory utilization. While the traditionally used ping-pong buffering technique is mapping subsequent activation layers to disjunctive memory regions, we propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently. This work presents the mathematical model to compute the maximum activations memory overlap and thus the lower bound of on-chip memory needed to perform layer-by-layer processing of convolutional neural networks on memory-limited accelerators. Our experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%, reducing the overall memory for the entire network by up to 23.9% compared to traditional ping-pong buffering. For higher resolution de-noising networks, we achieve activation memory savings of 48.8%. Additionally, we implement a face detector network on an FPGA-based camera to validate these memory savings on a complete end-to-end system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题