新的Amharic语音情感数据集和分类基准

论文标题

新的Amharic语音情感数据集和分类基准

A New Amharic Speech Emotion Dataset and Classification Benchmark

论文作者

Retta, Ephrem A., Almekhlafi, Eiad, Sutcliffe, Richard, Mhamed, Mustafa, Ali, Haider, Feng, Jun

论文摘要

在本文中，我们介绍了Amharic语音情感数据集（ASED），该数据集涵盖了四个方言（Gojjam，Wollo，Shewa和Gonder）和五种不同的情感（中性，恐惧，幸福，幸福，悲伤和愤怒）。我们认为这是Amharic语言的第一个语音情感识别（SER）数据集。 65名志愿者参与者，所有以母语为母语的人，记录了2,474个声音样本，长度为2到4秒。八名法官将情绪分配给了一致性高的样本（Fleiss Kappa = 0.8）。最终的数据集可自由下载。接下来，我们开发了众所周知的VGG模型的四层变体，我们称之为VGGB。然后使用ASED使用VGGB进行三个实验。首先，我们研究了MEL-Spectrogram特征或MEL频率Cepstral系数（MFCC）功能最适合Amharic。这是通过在ASED上训练两个VGGB SER模型来完成的，一种是使用MEL光谱图，另一个使用MFCC来完成。尝试了四种形式的培训，标准的交叉验证以及基于句子，方言和说话者组的三个变体。因此，用于培训的句子不会用于测试，而对方言和发言人组则相同。结论是，在所有四个训练方案下，MFCC功能都优越。因此，在实验2中采用了MFCC，其中VGGB和其他三个现有模型在ASED上进行了比较：RESNET50，ALEX-NET和LSTM。发现VGGB的精度非常好（90.73％），并且训练时间最快。在实验3中，在两个现有的SER数据集（英语）和Emo-DB（德语）以及ASED（AMHARIC）上进行了培训时，比较了VGGB的性能。结果在这些语言中相当，ASED是最高的。这表明VGGB可以成功应用于其他语言。我们希望ASED会鼓励研究人员尝试其他模型的Amharic Ser。

In this paper we present the Amharic Speech Emotion Dataset (ASED), which covers four dialects (Gojjam, Wollo, Shewa and Gonder) and five different emotions (neutral, fearful, happy, sad and angry). We believe it is the first Speech Emotion Recognition (SER) dataset for the Amharic language. 65 volunteer participants, all native speakers, recorded 2,474 sound samples, two to four seconds in length. Eight judges assigned emotions to the samples with high agreement level (Fleiss kappa = 0.8). The resulting dataset is freely available for download. Next, we developed a four-layer variant of the well-known VGG model which we call VGGb. Three experiments were then carried out using VGGb for SER, using ASED. First, we investigated whether Mel-spectrogram features or Mel-frequency Cepstral coefficient (MFCC) features work best for Amharic. This was done by training two VGGb SER models on ASED, one using Mel-spectrograms and the other using MFCC. Four forms of training were tried, standard cross-validation, and three variants based on sentences, dialects and speaker groups. Thus, a sentence used for training would not be used for testing, and the same for a dialect and speaker group. The conclusion was that MFCC features are superior under all four training schemes. MFCC was therefore adopted for Experiment 2, where VGGb and three other existing models were compared on ASED: RESNet50, Alex-Net and LSTM. VGGb was found to have very good accuracy (90.73%) as well as the fastest training time. In Experiment 3, the performance of VGGb was compared when trained on two existing SER datasets, RAVDESS (English) and EMO-DB (German) as well as on ASED (Amharic). Results are comparable across these languages, with ASED being the highest. This suggests that VGGb can be successfully applied to other languages. We hope that ASED will encourage researchers to experiment with other models for Amharic SER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题