PEX：通过部分执行的记忆有效的微控制器深度学习

论文标题

PEX：通过部分执行的记忆有效的微控制器深度学习

Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution

论文作者

Liberis, Edgar, Lane, Nicholas D.

论文摘要

可以通过利用在设备深度学习的情况下，可以使嵌入式和IoT设备在很大程度上由微控制器单元（MCUS）提供动力。 MCU神经网络推断的主要挑战之一是芯片上的读写量极有限（SRAM，<512 kb）。 SRAM被神经网络层（操作员）输入和输出缓冲区消耗，传统上，该缓冲区必须在内存（物质化）中才能执行操作员。我们讨论了微控制器深度学习的一种新颖的执行范式，该范式修改了神经网络的执行，以避免在内存中实现完整的缓冲区，从而大大降低了SRAM使用情况，而没有计算开销。这是通过利用操作员的性能来实现的，该操作员一次可以一次消耗/产生其输入/输出的一部分。我们描述了一个部分执行编译器PEX，该编译器通过识别可以沿功能（“通道”）维度拆分执行的操作员的子图来自动产生内存有效的执行时间表。通过使用结构化修剪来靶向内存瓶颈，可以进一步减少内存使用量，从而导致网络体系结构的共同设计及其执行时间表。我们对图像和音频分类模型的评估：（a）在低SRAM使用方案中建立最先进的性能，以提高精度升高 +2.9％；（b）发现，通过单独使用部分执行或使用编译器 - 预拆式共同设计时，可以通过单独使用部分执行或最高10.5倍来减少4倍的内存，同时与先前的工作相比保持分类准确性；（c）使用恢复的SRAM来处理更高的分辨率输入，将视觉唤醒单词的精度提高高达 +3.9％。

Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.

下载PDF全文

下载文献需遵守相关版权规定

论文标题