快速而有效的语音情感识别自我验证

论文标题

快速而有效的语音情感识别自我验证

Fast Yet Effective Speech Emotion Recognition with Self-distillation

论文作者

Ren, Zhao, Nguyen, Thanh Tam, Chang, Yi, Schuller, Björn W.

论文摘要

言语情感识别（SER）是从语音中认识到人类情绪状态的任务。 Ser在帮助对话系统真正理解我们的情绪并成为值得信赖的人类对话伙伴方面非常普遍。由于语音的冗长性，SER还遭受了深层神经网络等强大模型缺乏广泛的标记数据。大规模语音数据集上的预训练的复杂模型已通过转移学习成功地应用于SER。但是，微调复杂模型仍然需要较大的记忆空间，并导致推理效率较低。在本文中，我们认为通过自我验证可以实现快速而有效的SER，这是一种同时微调验证的模型和训练较浅版本的方法。我们的自我验证框架的好处是三重：（1）在声学方式上，通过有限的语音数据真实真实真实性，采用自distillation方法，并在SER数据集上表现出色的模型的性能；（2）在不同深度执行强大的模型可以在资源有限的边缘设备上实现自适应精度效率折衷；（3）一个新的微调过程，而不是从头开始训练以进行自我验证会导致更快的学习时间和具有少量标签信息的数据的最新精度。

Speech emotion recognition (SER) is the task of recognising human's emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depth can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题