快速杂质在CPU上

论文标题

快速杂质在CPU上

Fast DistilBERT on CPUs

论文作者

Shen, Haihao, Zafrir, Ofir, Dong, Bo, Meng, Hengyu, Ye, Xinyu, Wang, Zhe, Ding, Yi, Chang, Hanwen, Boudoukh, Guy, Wasserblat, Moshe

论文摘要

基于变压器的语言模型已成为解决自然语言处理任务的标准方法。但是，行业采用通常需要最大的吞吐量才能遵守某些阻止变形金刚模型在生产中使用的潜伏期约束。为了解决这一差距，可以使用量化和修剪等模型压缩技术来提高推理效率。但是，这些压缩技术需要专门的软件才能大规模应用和部署。在这项工作中，我们提出了一条新的管道，用于在CPU上创建和运行快速变压器模型，利用硬件感知的修剪，知识蒸馏，量化以及我们自己的变压器推理运行时引擎，具有优化的内核，用于稀疏和量化的操作员。我们通过创建一个快速的大型杂货模型来证明管道的效率，显示出避开问题的squadv1.1基准和典型生产约束和环境下的吞吐量结果。我们的结果优于现有的最新神经魔术的DeepSparse运行时性能，高达50％，在ONNX运行时高达4.1倍的性能加速。源代码可在https://github.com/intel/intel-extension-for-transformers上公开获取。

Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime. Source code is publicly available at https://github.com/intel/intel-extension-for-transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题