论文标题
Divetalk:文本到语音作为机器翻译问题
DiscreTalk: Text-to-Speech as a Machine Translation Problem
论文作者
论文摘要
本文提出了基于神经机器翻译(NMT)的新的端到端文本到语音(E2E-TTS)模型。提出的模型由两个组成部分组成。非自动向量矢量量化变分自动编码器(VQ-VAE)模型和自回归变压器-NMT模型。 VQ-VAE模型从语音波形学习映射函数到一系列离散符号,然后对变压器-NMT模型进行训练,以从给定的输入文本中估算此离散符号序列。由于VQ-VAE模型可以以完全数据驱动的方式学习这样的映射,因此我们不需要考虑常规E2E-TTS模型中所需的特征提取的超参数。由于使用了离散符号,我们可以使用NMT和自动语音识别(ASR)中开发的各种技术(例如梁搜索,子单词单元和具有语言模型的融合)。此外,我们可以避免预测功能的过度平滑问题,这是TTS中常见问题之一。使用JSUT语料库进行的实验评估表明,所提出的方法的表现优于传统的变压器-TTS模型,具有自然性的非自动性神经声码器,实现了与VQ-VAE模型的重建相当的性能。
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a mapping function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a mapping in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore, we can avoid an over smoothing problem of predicted features, which is one of the common issues in TTS. The experimental evaluation with the JSUT corpus shows that the proposed method outperforms the conventional Transformer-TTS model with a non-autoregressive neural vocoder in naturalness, achieving the performance comparable to the reconstruction of the VQ-VAE model.