具有混合FP16-INT8训练后量化的多核MCU上基于RNN的语音增强

论文标题

具有混合FP16-INT8训练后量化的多核MCU上基于RNN的语音增强

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

论文作者

Rusci, Manuele, Fariselli, Marco, Croome, Martin, Paci, Francesco, Flamand, Eric

论文摘要

本文介绍了一种优化的方法，用于设计和部署语音增强（SE）算法（基于先进的微控制器单元（MCU）的复发性神经网络（RNN），具有1+8通用RISC-RISC-V核。为了实现低延迟的执行，我们提出了一项优化的软件管道，通过矢量化的8位整数（INT8）和16位浮点（FP16）计算单元，并具有手动管理的模型参数内存存储器传递器，以矢量化的8位整数（INT8）和16位浮点数（FP16）计算单元进行了优化的软件管道。为了确保相对于完整模型的最低准确性降解，我们提出了一种新型的FP16-INT8混合精液后训练后量化（PTQ）方案，将复发层压缩至8位，而其余层的位精度则保持到FP16。实验是在多个LSTM和GRU基于GRU的SE模型上进行的，该模型在Valentini数据集上训练，最高为124万参数。由于提出的方法，我们相对于无损的FP16基准，我们将计算加速长达4倍。与统一的8位量化不同的方式不同，将PESQ得分降低了0.3，混合精确的PTQ方案仅导致低降解仅为0.06，同时可节省1.4-1.7倍的记忆。由于这种压缩，我们通过将大型型号安装在有限的芯片非挥发性内存上来降低外部内存的功率成本，并通过将电源电压从0.8V降低到0.65V，同时仍然匹配实时约束，从而最大节省了MCU功率。我们的设计结果比部署在单核MCU上的最先进的SE解决方案要高10倍，该解决方案使用较小的模型和量化感知的训练。

This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4x with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by reducing the supply voltage from 0.8V to 0.65V while still matching the real-time constraints. Our design results 10x more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题