解耦知识蒸馏

论文标题

解耦知识蒸馏

Decoupled Knowledge Distillation

论文作者

Zhao, Borui, Cui, Quan, Song, Renjie, Qiu, Yiyu, Liang, Jiajun

论文摘要

最先进的蒸馏方法主要基于中间层的深层特征，而logit蒸馏的重要性被极大地忽略了。为了提供研究逻辑蒸馏的新观点，我们将经典的KD损失重新分为两个部分，即目标类知识蒸馏（TCKD）和非目标类别知识蒸馏（NCKD）。我们凭经验研究并证明了这两个部分的影响：TCKD传输有关训练样本“难度”的知识，而NCKD是Logit蒸馏起作用的重要原因。更重要的是，我们揭示了经典的KD损失是一种耦合配方，（1）抑制了NCKD的有效性，（2）限制了平衡这两个部分的灵活性。为了解决这些问题，我们提出了脱钩的知识蒸馏（DKD），使TCKD和NCKD能够更有效，更灵活地扮演自己的角色。与基于功能的复杂方法相比，我们的DKD可相当甚至更好的结果，并且在CIFAR-100，Imagenet和MS-Coco数据集上具有更好的培训效率，用于图像分类和对象检测任务。本文证明了Logit蒸馏的巨大潜力，我们希望它对将来的研究有所帮助。该代码可从https://github.com/megvii-research/mdistiller获得。

State-of-the-art distillation methods are mainly based on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. To provide a novel viewpoint to study logit distillation, we reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). We empirically investigate and prove the effects of the two parts: TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works. More importantly, we reveal that the classical KD loss is a coupled formulation, which (1) suppresses the effectiveness of NCKD and (2) limits the flexibility to balance these two parts. To address these issues, we present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly. Compared with complex feature-based methods, our DKD achieves comparable or even better results and has better training efficiency on CIFAR-100, ImageNet, and MS-COCO datasets for image classification and object detection tasks. This paper proves the great potential of logit distillation, and we hope it will be helpful for future research. The code is available at https://github.com/megvii-research/mdistiller.

下载PDF全文

下载文献需遵守相关版权规定

论文标题