MSEMOTTS：多尺度情绪转移，预测和控制情感语音综合

论文标题

MSEMOTTS：多尺度情绪转移，预测和控制情感语音综合

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

论文作者

Lei, Yi, Yang, Shan, Wang, Xinsheng, Xie, Lei

论文摘要

表达性综合语音对于许多人类计算机的互动和音频广播场景至关重要，因此近年来综合表达性语音引起了很多关注。以前的方法使用明确的标签或从参考音频中提取的固定长度样式进行了表达性语音综合，这两者都只能学习平均样式，因此忽略了语音韵律的多尺度性质。在本文中，我们提出了一种多尺度情感语音综合框架Msemotts，以模拟不同层次的情感。具体而言，提出的方法是一个典型的基于注意力的序列到序列模型，并提出了三个模块，包括全球级别的情绪呈现模块（GM），话语级别的情感呈现模块（UM）和本地级别的情绪表现模块（LM），以模拟全球情感类别，表达级别的情感级别的效果，并稳定情感和情感效果。除了对不同层次的情绪进行建模之外，提出的方法还使我们能够以不同的方式综合情感语音，即从参考音频传递情绪，预测输入文本中的情绪并手动控制情绪强度。在中国情感语音语料库上进行的广泛实验表明，所提出的方法优于比较基于参考音频和基于文本的情感语音综合方法，分别在情绪转移语音综合和基于文本的情感预测语音综合上。此外，实验还表明，所提出的方法可以灵活地控制情绪表达。详细的分析显示了每个模块的有效性以及所提出方法的良好设计。

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题