溢出：将流量放在神经传感器的顶部，以获得更好的TT

论文标题

溢出：将流量放在神经传感器的顶部，以获得更好的TT

OverFlow: Putting flows on top of neural transducers for better TTS

论文作者

Mehta, Shivam, Kirkland, Ambika, Lameris, Harm, Beskow, Jonas, Székely, Éva, Henter, Gustav Eje

论文摘要

神经HMM是最近在文本到语音中进行序列对序列建模的一种神经传感器。它们结合了经典统计语音综合的最佳功能和现代神经TT，需要更少的数据和较少的培训更新，并且不太容易受到神经注意力失败引起的gibberish输出。在本文中，我们将神经HMM TTS与标准化的流相结合，以描述语音声学的高度非高斯分布。结果是可以使用精确的最大似然训练的持续时间和声学的强大，完全概率的模型。实验表明，基于我们建议的系统所需的更新比可比的方法更少，以产生准确的发音和接近自然语音的主观语音质量。有关音频示例和代码，请参见https://shivammehta25.github.io/overflow/。

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题