喷气机：共同培训FastSpeech2和Hifi-gan以端到头文字到语音

论文标题

喷气机：共同培训FastSpeech2和Hifi-gan以端到头文字到语音

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

论文作者

Lim, Dan, Jung, Sunghee, Kim, Eesung

论文摘要

在神经文本到语音（TTS）中，两阶段系统或一系列单独学习的模型显示出接近人类语音的合成质量。例如，FastSpeech2将输入文本转换为MEL-SPECTROGRAM，然后HIFI-GAN从MEL-Spectrogram产生了原始波形，它们分别称为声学特征发生器和神经声码器。但是，他们的训练管道有些麻烦，因为它需要进行微调和准确的语音文本对齐，以实现最佳性能。在这项工作中，我们提出了端到端的文本到语音（E2E-TTS）模型，该模型具有简化的训练管道，并且表现优于单独学习的模型。具体而言，我们提出的模型是经过对齐模块的联合训练的FastSpeech2和HIFI-GAN。由于训练和推理之间没有声学特征不匹配，因此不需要微调。此外，我们通过在我们的联合培训框架中采用对齐学习目标来消除对外部语音文本对齐工具的依赖。在LJSpeech语料库上进行的实验表明，所提出的模型优于公开可用的模型，ESPNET2-TT在主观评估（MOS）（MOS）（MOS）和一些客观评估的最新实现。

In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题