预先训练的摘要蒸馏

论文标题

预先训练的摘要蒸馏

Pre-trained Summarization Distillation

论文作者

Shleifer, Sam, Rush, Alexander M.

论文摘要

汇总的最新最新方法利用了大型预训练的变压器模型。将这些模型提炼为较小的学生模型对实际使用至关重要。但是，NLP文献提出了许多不同的蒸馏方法。用于分类和回归任务的蒸馏伯特的最新工作显示了使用直接知识蒸馏的强劲绩效。另外，使用伪标记的机器翻译从业人员提炼了一个小型模型，该模型对较大模型的翻译进行了训练。第三种简单的方法是“收缩和微调”（SFT），该方法通过将参数复制到较小的学生模型然后进行微调来避免任何明确的蒸馏。我们比较了Pegasus和Bart的这三种方法，Pegasus and Bart是现行和以前的艺术状态，预先培训的摘要模型，并发现SFT在CNN/Dailymail Dataset上超过了知识蒸馏和伪标记，但对更不受欢迎的PSEUDO-BESEUDO-LACEDO-LAIMEALINGING在更为抽象的Xssum dataSaset上。可以通过拥抱脸部变压器在此处http://tiny.cc/4iy0tz提供不同尺寸的Pytorch代码和检查点。

Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here http://tiny.cc/4iy0tz.

下载PDF全文

下载文献需遵守相关版权规定

论文标题