关于使用自我监管的预训练的声学和语言特征连续言语情感识别

论文标题

关于使用自我监管的预训练的声学和语言特征连续言语情感识别

On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition

论文作者

Macary, Manon, Tahon, Marie, Estève, Yannick, Rousseau, Anthony

论文摘要

特征提取的预训练是一种越来越多的研究方法，可以更好地对音频和文本内容进行更好的连续表示。在目前的工作中，我们将WAV2VEC和Camembert用作自学的学习模型来表示我们的数据，以便从Allosat上的语音（SER）进行连续的情感识别，这是一个大型的法国情绪数据库，描述了满意度的尺寸，以及对艺术体系的状态，专注于Valence，唤醒和喜欢的上限。据作者所知，本文介绍了第一项研究表明，WAV2VEC和类似BERT的预训练特征的联合使用与处理连续的SER任务非常相关，该任务通常以少量标记的培训数据为特征。通过众所周知的一致性相关系数（CCC）评估，我们的实验表明，当使用MFCC与AlloSat数据集中的Word2Vec Word嵌入时，我们可以达到0.825的CCC值，而不是0.592。

Pre-training for feature extraction is an increasingly studied approach to get better continuous representations of audio and text content. In the present work, we use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech (SER) on AlloSat, a large French emotional database describing the satisfaction dimension, and on the state of the art corpus SEWA focusing on valence, arousal and liking dimensions. To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task, usually characterized by a small amount of labeled training data. Evaluated by the well-known concordance correlation coefficient (CCC), our experiments show that we can reach a CCC value of 0.825 instead of 0.592 when using MFCC in conjunction with word2vec word embedding on the AlloSat dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题