ASR扩展的非平行语音转换

论文标题

ASR扩展的非平行语音转换

Non-Parallel Voice Conversion for ASR Augmentation

论文作者

Wang, Gary, Rosenberg, Andrew, Ramabhadran, Bhuvana, Biadsy, Fadi, Huang, Yinghui, Emond, Jesse, Mengibar, Pedro Moreno

论文摘要

自动语音识别（ASR）需要对说话者的差异很强。语音转换（VC）修改了输入语音的扬声器特征。这是ASR数据增强的吸引人功能。在本文中，我们证明语音转换可以用作数据增强技术，以提高ASR性能，即使在LibrisPeech上，其中包含2,456位扬声器。对于ASR增强，有必要对广泛的输入语音稳健。这激发了使用非自动回旋，非并行VC模型的使用，并在VC模型中使用了预验证的ASR编码器。这项工作表明，尽管包含许多演讲者，但演讲者的多样性可能仍然限制ASR质量。最后，对我们的VC绩效的审讯为客观评估VC质量提供了有用的指标。

Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmentation, it is necessary that the VC model be robust to a wide range of input speech. This motivates the use of a non-autoregressive, non-parallel VC model, and the use of a pretrained ASR encoder within the VC model. This work suggests that despite including many speakers, speaker diversity may remain a limitation to ASR quality. Finally, interrogation of our VC performance has provided useful metrics for objective evaluation of VC quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题