从合成数据培训文本到语音系统：一种重音转移任务的实用方法

论文标题

从合成数据培训文本到语音系统：一种重音转移任务的实用方法

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

论文作者

Finkelstein, Lev, Zen, Heiga, Casagrande, Norman, Chan, Chun-an, Jia, Ye, Kenter, Tom, Petelin, Alexey, Shen, Jonathan, Wan, Vincent, Zhang, Yu, Wu, Yonghui, Clark, Rob

论文摘要

文本到语音（TTS）合成中的转移任务 - 一组说话者的语音的一个或多个方面被转移到另一组不具有这些方面的说话者 - 仍然是一个具有挑战性的任务。挑战之一是，具有高质量转移功能的模型可能会在稳定性上存在问题，这使得它们对于面向用户的关键任务不切实际。本文表明，可以通过训练强大的TTS系统来获得转移，该系统是由专为高质量转移任务设计的较不强大的TTS系统生成的数据；特别是，Chive-Bert单语言TTS系统是针对设计用于重音转移的Tacotron模型的输出的。尽管这种方法不可避免地有一些质量损失，但实验结果表明，以这种方式培训的合成数据训练的模型可以产生高质量的音频，显示出重音转移，同时保留了诸如口语风格之类的扬声器特征。

Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.

下载PDF全文

下载文献需遵守相关版权规定

论文标题