论文标题
使用蒸馏和高效的模型进行实用的唇读
Towards Practical Lipreading with Distilled and Efficient Models
论文作者
论文摘要
由于神经网络的复兴,唇读已经取得了很大的进步。最近的工作重点是通过找到最佳体系结构或改善概括来提高性能等方面。但是,在实际情况下,目前的方法论和有效部署唇读的要求之间仍然存在很大的差距。在这项工作中,我们提出了一系列的创新,这些创新大大弥合了这一差距:首先,我们分别使用自我陈述,将LRW和LRW-1000的最新绩效提高到了较大的LRW和LRW-1000的利润率,分别为88.5%和46.6%。其次,我们提出了一系列的体系结构变化,包括一个新型的深度可分离时间卷积网络(DS-TCN)头,将计算成本削减到(已经很有效)原始模型的一小部分。第三,我们表明知识蒸馏是恢复轻质模型性能的非常有效的工具。这导致了一系列具有不同准确效率折衷的模型。但是,我们最有希望的轻型模型与当前的最新模型相提并论,同时在计算成本和参数的数量方面显示出8.2倍和3.9倍的减少,我们希望这将使在实际应用中部署口语模型。
Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation. Secondly, we propose a series of architectural changes, including a novel Depthwise Separable Temporal Convolutional Network (DS-TCN) head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8.2x and 3.9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.