理解和改善知识蒸馏

论文标题

理解和改善知识蒸馏

Understanding and Improving Knowledge Distillation

论文作者

Tang, Jiaxi, Shivanna, Rakesh, Zhao, Zhe, Lin, Dong, Singh, Anima, Chi, Ed H., Jain, Sagar

论文摘要

知识蒸馏（KD）是一种模型不足的技术，可在具有固定容量预算的同时提高模型质量。这是一种用于模型压缩的常用技术，其中具有更高质量的较大容量的教师模型用于训练更紧凑的学生模型，其推理效率更高。通过蒸馏，人们希望从学生的紧凑性中受益，而不会牺牲过多的模型质量。尽管知识蒸馏取得了很大的成功，但更好地了解IT如何使学生模型的培训动态造成的理解仍然不足。在本文中，我们将教师的知识分为三个层次级别，并研究其对知识蒸馏的影响：（1）“宇宙”的知识，其中KD通过标签平滑带来了正则化效果；（2）域知识，教师在学生的logit层几何形状之前注入班级关系；（3）特定实例的知识，其中教师基于对事件难度的测量来重新缩放学生模型的均值梯度。利用系统分析和对合成和现实世界数据集的广泛实证研究，我们确认上述三个因素在知识蒸馏中起主要作用。此外，根据我们的发现，我们诊断出了最近研究中应用KD的一些故障案例。

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is used to train a more compact student model with better inference efficiency. Through distillation, one hopes to benefit from student's compactness, without sacrificing too much on model quality. Despite the large success of knowledge distillation, better understanding of how it benefits student model's training dynamics remains under-explored. In this paper, we categorize teacher's knowledge into three hierarchical levels and study its effects on knowledge distillation: (1) knowledge of the `universe', where KD brings a regularization effect through label smoothing; (2) domain knowledge, where teacher injects class relationships prior to student's logit layer geometry; and (3) instance specific knowledge, where teacher rescales student model's per-instance gradients based on its measurement on the event difficulty. Using systematic analyses and extensive empirical studies on both synthetic and real-world datasets, we confirm that the aforementioned three factors play a major role in knowledge distillation. Furthermore, based on our findings, we diagnose some of the failure cases of applying KD from recent studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题