通过量化压缩生成的预训练语言模型

论文标题

通过量化压缩生成的预训练语言模型

Compression of Generative Pre-trained Language Models via Quantization

论文作者

Tao, Chaofan, Hou, Lu, Zhang, Wei, Shang, Lifeng, Jiang, Xin, Liu, Qun, Luo, Ping, Wong, Ngai

论文摘要

生成预训练的语言模型（PLM）的规模不断增加，大大增加了对模型压缩的需求。尽管有多种压缩BERT或其变体的方法，但很少有尝试压缩生成PLM的尝试，而潜在的难度仍然不清楚。在本文中，我们通过量化来压缩生成PLM。我们发现，由于\ textit {均质单词嵌入}的生成任务，先前的量化方法失败了，由降低的容量引起，\ textit {重量的变化分布}。相应地，我们提出了一个令牌级的对比度蒸馏，以学习可区分的单词嵌入，并通过模块的动态缩放来使量化器适应不同的模块。各种任务的经验结果表明，我们所提出的方法的表现优于生成PLM上的最新压缩方法。通过与完整模型相当的性能，我们在GPT-2和BART上分别达到14.4倍和13.4倍的压缩率。

The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the \textit{homogeneous word embeddings} caused by reduced capacity, and \textit{varied distribution of weights}. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rates on GPT-2 and BART, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题