论文标题
半监督的学习,用于连续的情感强度可控语音综合,并具有分离的表示形式
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations
论文作者
论文摘要
最近的文本到语音模型已经达到了产生与人类所说的自然语音的水平。但是在表现力方面仍然存在局限性。现有的情感语音综合模型已使用插值特征在情绪潜在空间中具有缩放参数显示可控性。但是,由于现有模型产生的情感潜在空间很难控制连续的情感强度,因为情感,扬声器等特征的纠缠。在本文中,我们提出了一种新颖的方法来使用半抑制学习的学习来控制情绪的连续强度。该模型使用语音信息的音素级序列产生的伪标记来学习中间强度的情绪。由提议的模型构建的嵌入空间以情感基础满足统一的网格几何形状。实验结果表明,所提出的方法在可控性和自然性方面表现出色。
Recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. The existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. The experimental results showed that the proposed method was superior in controllability and naturalness.