无监督的TTS声学模型，用于有条件分离的顺序VAE的TTS

论文标题

无监督的TTS声学模型，用于有条件分离的顺序VAE的TTS

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

论文作者

Lian, Jiachen, Zhang, Chunlei, Anumanchipalli, Gopala Krishna, Yu, Dong

论文摘要

在本文中，我们提出了一种新颖的无监督的文本到语音模型训练方案，名为UTTS，该方案不需要文本原告。 UTTS是一种多扬声器语音合成器，支持零声语音克隆，它是从分离的语音表示学习的角度开发的。该框架提供了扬声器持续时间模型，Timbre功能（身份）和TTS推理内容的灵活选择。我们利用自我监督的语音表示学习以及系统开发的前端技术的最新进步。具体而言，我们采用了最近制定的条件分解的顺序变异自动编码器（C-DSVAE）作为主链UTTS AM，该骨架UTTS AM提供了良好的结构化内容表示，并在训练过程中提供了无监督的对齐（UA）作为条件。对于UTTS推断，我们利用词典将输入文本映射到音素序列，该序列将通过依赖说话者依赖的持续时间模型扩展到框架级强制对齐（FA）。然后，我们开发一个将FA转换为UA的对齐映射模块。最后，用作自我监督的TTS AM的C-DSVAE将预测的UA和目标扬声器嵌入嵌入，以生成MEL频谱图，最终将其与神经声码器一起转换为波形。我们展示了我们的方法如何启用语音综合，而无需在AM开发阶段使用配对的TTS语料库。实验表明，UTT可以综合人类和客观评估衡量的高自然和清晰度的语音。音频样本可在我们的演示页面https://neurtts.github.io/utts \ _demo/上找到。

In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus in AM development stage. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page https://neurtts.github.io/utts\_demo/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题