情感语音转换与周期一致的对抗网络

论文标题

情感语音转换与周期一致的对抗网络

Emotional Voice Conversion With Cycle-consistent Adversarial Network

论文作者

Liu, Songxiang, Cao, Yuewen, Meng, Helen

论文摘要

情感语音转换或情感VC是一种将语音从一种情感状态转换为另一种情感状态的技术，并保留基本的语言信息和说话者的身份。情绪VC的先前方法需要并行数据，并使用动态时间扭曲（DTW）方法来暂时对齐源目标语音参数。这些方法通常将最小生成损失定义为目标函数，例如L1或L2损失，以学习模型参数。最近，循环一致的生成对抗网络（Cyclean）已成功用于非平行VC。本文研究了使用CycleGAN进行情感VC任务的功效。 Cyclegan并没有尝试使用框架到最小的生成损失来学习并行训练数据之间的映射，而是使用两个歧视器和一个分类器来指导学习过程，在这种过程中，鉴别器旨在区分自然和转换的语音和分类器的目的，而分类器的目标是将潜在的情绪从自然和转换的语音中分类。自行车模型的训练过程随机配对源目标语音参数，而无需任何时间对齐操作。客观和主观评估结果证实了将CycleGAN模型用于情绪VC的有效性。对周期内的非平行训练表明其对非平行情绪VC的潜力。

Emotional Voice Conversion, or emotional VC, is a technique of converting speech from one emotion state into another one, keeping the basic linguistic information and speaker identity. Previous approaches for emotional VC need parallel data and use dynamic time warping (DTW) method to temporally align the source-target speech parameters. These approaches often define a minimum generation loss as the objective function, such as L1 or L2 loss, to learn model parameters. Recently, cycle-consistent generative adversarial networks (CycleGAN) have been used successfully for non-parallel VC. This paper investigates the efficacy of using CycleGAN for emotional VC tasks. Rather than attempting to learn a mapping between parallel training data using a frame-to-frame minimum generation loss, the CycleGAN uses two discriminators and one classifier to guide the learning process, where the discriminators aim to differentiate between the natural and converted speech and the classifier aims to classify the underlying emotion from the natural and converted speech. The training process of the CycleGAN models randomly pairs source-target speech parameters, without any temporal alignment operation. The objective and subjective evaluation results confirm the effectiveness of using CycleGAN models for emotional VC. The non-parallel training for a CycleGAN indicates its potential for non-parallel emotional VC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题