Flowtron：一种基于自回旋的基于流动的生成网络，用于文本到语音综合

论文标题

Flowtron：一种基于自回旋的基于流动的生成网络，用于文本到语音综合

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

论文作者

Valle, Rafael, Shih, Kevin, Prenger, Ryan, Catanzaro, Bryan

论文摘要

在本文中，我们提出了Flowtron：基于自回归流量的生成网络，用于文本到语音综合，并控制语音变化和样式转移。 Flowtron借用了IAF的见解，并改造了TaCotron，以提供高质量和表达性的旋光图合成。 Flowtron通过最大化训练数据的可能性进行了优化，这使训练变得简单而稳定。 Flowtron了解了将数据映射到潜在空间的可逆映射，该映射可以操纵，以控制语音合成的许多方面（音调，音调，语音，语音速率，节奏，口音）。我们的平均意见分数（MOS）表明，Flowtron就语音质量而言与最先进的TTS模型相匹配。此外，我们还提供了控制语音变化，样本之间的插值以及在训练过程中看到和看不见的说话者之间的样式转移的结果。代码和预培训模型将在https://github.com/nvidia/flowtron上公开提供

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at https://github.com/NVIDIA/flowtron

下载PDF全文

下载文献需遵守相关版权规定

论文标题