论文标题
知识蒸馏:不良模型可以是好的榜样
Knowledge Distillation: Bad Models Can Be Good Role Models
论文作者
论文摘要
在过度参数化制度中训练的大型神经网络能够使噪声适合零火车误差。最近的工作\ citep {nakkiran2020distributional}在经验上观察到,这种网络的表现是嘈杂分布的“条件采样器”。也就是说,他们将火车数据中的噪声复制到看不见的示例。我们提供了一个理论框架,用于在学习理论的背景下研究这种条件抽样行为。我们将此类采样器的概念与知识蒸馏联系起来,其中学生网络在未标记的数据上模仿教师的输出。我们表明,采样器虽然是糟糕的分类器,但可以是好老师。具体而言,我们证明,从采样器中蒸馏可以产生近似贝叶斯最佳分类器的学生。最后,我们表明,当在过度参数化制度中应用时,某些常见的学习算法(例如,最近的邻居和内核机)可以生成采样器。
Large neural networks trained in the overparameterized regime are able to fit noise to zero train error. Recent work \citep{nakkiran2020distributional} has empirically observed that such networks behave as "conditional samplers" from the noisy distribution. That is, they replicate the noise in the train data to unseen examples. We give a theoretical framework for studying this conditional sampling behavior in the context of learning theory. We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data. We show that samplers, while being bad classifiers, can be good teachers. Concretely, we prove that distillation from samplers is guaranteed to produce a student which approximates the Bayes optimal classifier. Finally, we show that some common learning algorithms (e.g., Nearest-Neighbours and Kernel Machines) can generate samplers when applied in the overparameterized regime.