自我监督的学习以稳健的语音克隆

论文标题

自我监督的学习以稳健的语音克隆

Self-supervised learning for robust voice cloning

论文作者

Klapsas, Konstantinos, Ellinas, Nikolaos, Nikitaras, Karolos, Vamvoukakis, Georgios, Kakoulidis, Panos, Markopoulos, Konstantinos, Raptis, Spyros, Sung, June Sig, Jho, Gunu, Chalamandaris, Aimilios, Tsiakoulis, Pirros

论文摘要

语音克隆是一项艰巨的任务，需要在高质量的TTS系统中纳入强大而有益的功能，以便有效地复制看不见的扬声器的声音。在我们的工作中，我们利用了通过Bootstrap您自己的潜伏（BYOL）方法在自我监督框架中学习的功能，当将特定的音频增强应用于香草算法时，该方法可产生高质量的语音表示形式。我们进一步扩展了训练程序中的增强，以帮助所得的功能捕获说话者的身份，并使它们适应噪声和声学条件。学习的功能被用作预训练的话语级嵌入，并将其作为基于TACOTRON的非竞争性架构的输入，旨在实现多言式言语语音综合而无需使用其他扬声器功能。这种方法使我们能够在未标记的多座数据集中训练模型，并使用看不见的扬声器嵌入来复制扬声器的声音。主观和客观评估用于验证所提出的模型，以及对目标话语的声学条件的鲁棒性。

Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题