多语言语音情感识别具有多门式机制和神经结构搜索

论文标题

多语言语音情感识别具有多门式机制和神经结构搜索

Multilingual Speech Emotion Recognition With Multi-Gating Mechanism and Neural Architecture Search

论文作者

Wang, Zihan, Meng, Qi, Lan, HaiFeng, Zhang, XinRui, Guo, KeHao, Gupta, Akshat

论文摘要

言语情感识别（SER）将音频分类为情感类别，例如快乐，愤怒，恐惧，厌恶和中立。虽然语音情感识别（SER）是流行语言的常见应用，但对于低资源语言，即没有鉴定语音到文本识别模型的语言仍然是一个问题。本文首先提出了一种特定于语言的模型，该模型从多个预训练的语音模型中提取情感信息，然后设计一个多域模型，该模型同时为各种语言执行SER。我们的多域模型采用了多门的多门机制来为每种语言生成独特的加权功能组合，并通过神经体系结构搜索模块为每种语言寻找特定的神经网络结构。此外，我们引入了对比度辅助损失，以构建音频数据更可分离的表示形式。我们的实验表明，我们的模型将德语的最新准确性提高了3％，法语提高了14.3％的精度。

Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral. While Speech Emotion Recognition (SER) is a common application for popular languages, it continues to be a problem for low-resourced languages, i.e., languages with no pretrained speech-to-text recognition models. This paper firstly proposes a language-specific model that extract emotional information from multiple pre-trained speech models, and then designs a multi-domain model that simultaneously performs SER for various languages. Our multidomain model employs a multi-gating mechanism to generate unique weighted feature combination for each language, and also searches for specific neural network structure for each language through a neural architecture search module. In addition, we introduce a contrastive auxiliary loss to build more separable representations for audio data. Our experiments show that our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.

下载PDF全文

下载文献需遵守相关版权规定

论文标题