论文标题
学习季节域中的语音情感表征
Learning Speech Emotion Representations in the Quaternion Domain
论文作者
论文摘要
语音信号中人类情绪表达的建模是一项重要但又具有挑战性的任务。语音情感识别模型的高度资源需求,结合了情绪标记的数据的普遍稀缺性,是该领域有效解决方案的开发和应用的障碍。在本文中,我们提出了一种共同规避这些困难的方法。我们的方法名为Rh-Emo,是一种新型的半监督结构,旨在从实价值的单声道光谱图中提取四元素嵌入,使能够使用Quaternion Vared的网络来用于语音情感识别任务。 RH-EMO是一个混合的真实/四元素自动编码器网络,由与实价的情感分类器和Quaternion值相关的解码器并行的实价编码器组成。一方面,分类器允许优化嵌入的每个潜在轴,以分类特定的情感相关特征:价,唤醒,优势和整体情感。另一方面,四元基因重建使潜在维度能够开发出有效代表作为Quaternion实体所需的渠道内相关性。我们使用四个流行的数据集测试了言语情感识别任务的方法:Iemocap,ravdess,Emodb和Tess,比较了三个良好成熟的实价CNN体系结构(Alexnet,Resnet-50,VGG)及其Quaternion Valueed等效的胚胎与嵌入与rh-eme的嵌入相关联的相等的性能。我们在所有数据集的测试准确性方面都具有一致的提高,同时大大降低了对模型的资源需求。此外,我们进行了其他实验和消融研究,以证实我们方法的有效性。 RH-EMO存储库可在以下网址找到:https://github.com/ispamm/rhemo。
The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.