论文标题
改善自然性,清晰度和语音发音的生成模型
Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech
论文作者
论文摘要
这项工作适应了生成模型的两种近期架构,并评估了它们对窃窃私语的语音转换为正常语音的有效性。我们将正常的目标语音纳入了矢量定量的变异自动编码器(VQ-VAE)和Melgans的训练标准中,从而调节系统以从低语输入中恢复声音语音。客观和主观质量措施表明,VQ-VAE和Melgans都可以修改以执行转换任务。我们发现所提出的方法显着将MEL CEPSTRAL失真(MCD)度量提高至少25%,相对于盘中基线。主观的听力测试表明,与低语的输入语音相比,基于梅尔根的系统可显着提高自然性,清晰度和声音。基于潜在语音表示之间的差异的新型评估措施也表明,基于梅尔根的方法相对于基线而产生改善。
This work adapts two recent architectures of generative models and evaluates their effectiveness for the conversion of whispered speech to normal speech. We incorporate the normal target speech into the training criterion of vector-quantized variational autoencoders (VQ-VAEs) and MelGANs, thereby conditioning the systems to recover voiced speech from whispered inputs. Objective and subjective quality measures indicate that both VQ-VAEs and MelGANs can be modified to perform the conversion task. We find that the proposed approaches significantly improve the Mel cepstral distortion (MCD) metric by at least 25% relative to a DiscoGAN baseline. Subjective listening tests suggest that the MelGAN-based system significantly improves naturalness, intelligibility, and voicing compared to the whispered input speech. A novel evaluation measure based on differences between latent speech representations also indicates that our MelGAN-based approach yields improvements relative to the baseline.