论文标题
朝着针对硬件的自动压缩神经网络
Towards Hardware-Specific Automatic Compression of Neural Networks
论文作者
论文摘要
压缩神经网络体系结构对于允许模型嵌入或移动设备的部署非常重要,而修剪和量化是如今压缩神经网络的主要方法。当针对每一层专门选择压缩参数时,这两种方法都受益。由于问题跨越指数较大的搜索空间,因此很难找到压缩参数的良好组合,即所谓的压缩策略。有效的压缩政策考虑特定硬件体系结构对使用的压缩方法的影响。我们提出了一个名为Galen的算法框架,使用修剪和量化的增强学习学习搜索此类策略,从而为神经网络提供自动压缩。与其他方法相反,我们使用在目标硬件设备上测量的推理延迟作为优化目标。因此,该框架支持特定于给定硬件目标的模型的压缩。我们使用三种不同的强化学习剂来验证我们的方法,以修剪,量化和关节修剪和量化。除了证明我们的方法的功能外,我们还能够在嵌入式ARM处理器上压缩CIFAR-10的RESNET18,占原始推理潜伏期的20%,而不会明显损失准确性。此外,我们可以证明,使用修剪和量化的联合搜索和压缩优于使用单个压缩方法对策略的单独搜索。
Compressing neural network architectures is important to allow the deployment of models to embedded or mobile devices, and pruning and quantization are the major approaches to compress neural networks nowadays. Both methods benefit when compression parameters are selected specifically for each layer. Finding good combinations of compression parameters, so-called compression policies, is hard as the problem spans an exponentially large search space. Effective compression policies consider the influence of the specific hardware architecture on the used compression methods. We propose an algorithmic framework called Galen to search such policies using reinforcement learning utilizing pruning and quantization, thus providing automatic compression for neural networks. Contrary to other approaches we use inference latency measured on the target hardware device as an optimization goal. With that, the framework supports the compression of models specific to a given hardware target. We validate our approach using three different reinforcement learning agents for pruning, quantization and joint pruning and quantization. Besides proving the functionality of our approach we were able to compress a ResNet18 for CIFAR-10, on an embedded ARM processor, to 20% of the original inference latency without significant loss of accuracy. Moreover, we can demonstrate that a joint search and compression using pruning and quantization is superior to an individual search for policies using a single compression method.