AOCstream：基于流的线条缓冲器体系结构的全芯片CNN加速器

论文标题

AOCstream：基于流的线条缓冲器体系结构的全芯片CNN加速器

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

论文作者

Kang, Hyeong-Ju

论文摘要

卷积神经网络（CNN）加速器已被广泛用于其效率，但需要大量内存，从而导致使用缓慢而动力消耗的外部内存。本文利用了两个方案，以减少所需的内存量，并最终仅在诸如低端FPGA之类的实用设备的芯片内存中实现合理性能的CNN。为了减少中间数据的内存量，提出了基于流的线路屏障架构和该体系结构的数据流，而不是基于传统的基于框架的体系结构，其中中间数据存储器的量与输入图像大小的平方成正比。该体系结构由以输入和输出流的管道方式运行的图层划分的块组成。每个卷积层块都有一个线缓冲区，仅存储几行输入数据。线缓冲区的尺寸与输入图像的宽度成正比，因此架构所需的中间数据存储比基于常规框架的体系结构少，尤其是在现代对象检测CNN中获得更大输入大小的趋势。除了减少中间数据存储外，加速器感知的修剪还会减少重量。实验结果表明，即使在没有外部内存的情况下，也可以在低端FPGA上实现整个对象检测CNN。与以前具有相似对象检测精度的加速器相比，即使使用LUTS，寄存器和DSP的FPGA资源较少，提出的加速器也达到更高的吞吐量，显示出更高的效率。训练有素的模型和实施的位文件可在https://github.com/hyeegnjukang/accelerator-aware-pruning和https://github.com/hyeeegnjukang/aocstream上找到。

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of a slow and power consuming external memory. This paper exploits two schemes to reduce the required memory amount and ultimately to implement a CNN of reasonable performance only with on-chip memory of a practical device like a low-end FPGA. To reduce the memory amount of the intermediate data, a stream-based line-buffer architecture and a dataflow for the architecture are proposed instead of the conventional frame-based architecture, where the amount of the intermediate data memory is proportional to the square of the input image size. The architecture consists of layer-dedicated blocks operating in a pipelined way with the input and output streams. Each convolutional layer block has a line buffer storing just a few rows of input data. The sizes of the line buffers are proportional to the width of the input image, so the architecture requires less intermediate data storage than the conventional frame-based architecture, especially in the trend of getting larger input size in modern object detection CNNs. In addition to the reduced intermediate data storage, the weight memory is reduced by the accelerator-aware pruning. The experimental results show that a whole object detection CNN can be implemented even on a low-end FPGA without an external memory. Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches much higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing much higher efficiency. The trained models and implemented bit files are available at https://github.com/HyeongjuKang/accelerator-aware-pruning and https://github.com/HyeongjuKang/aocstream.

下载PDF全文

下载文献需遵守相关版权规定

论文标题