我们可以使用常见的语音训练多演讲者TTS系统吗？

论文标题

我们可以使用常见的语音训练多演讲者TTS系统吗？

Can we use Common Voice to train a Multi-Speaker TTS system?

论文作者

Ogun, Sewade, Colotte, Vincent, Vincent, Emmanuel

论文摘要

培训多演讲者文本到语音（TTS）系统依赖于基于高质量录音或有声读物的策划数据集。这样的数据集通常缺乏扬声器的多样性，并且收集昂贵。另外，最近的研究利用了大型众包自动语音识别（ASR）数据集的可用性。此类数据集的一个主要问题是存在嘈杂和/或扭曲的样品，它会降低TTS质量。在本文中，我们建议使用非侵入性平均意见评分（MOS）估计器WV-MOS自动选择高质量的训练样本。我们显示了这种方法的可行性，用于训练普通语音英语数据集上的多演讲者Glowtts模型。对于所有样本的训练，在库里特数据集上的培训方面，我们的方法将产生的话语的总体质量提高了1.26 mos点。这为更广泛的语言范围的自动TTS数据集策划打开了大门。

Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题