使用声学上下文调节，嵌入和参考编码器，朝着基于文本的语音编辑零拍。

论文标题

使用声学上下文调节，嵌入和参考编码器，朝着基于文本的语音编辑零拍。

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

论文作者

Fong, Jason, Wang, Yun, Agrawal, Prabhav, Manohar, Vimal, Wu, Jilong, Köhler, Thilo, He, Qing

论文摘要

基于文本的语音编辑（TBVE）使用从文本到语音（TTS）系统的合成输出来替换原始录音中的单词。最近的工作使用了神经模型来产生与原始语音相似的编辑语音，从而在清晰度，说话者身份和韵律方面类似。但是，先前工作的一个局限性是使用鉴定以优化性能：这需要对目标扬声器的数据进行进一步的模型培训，这是一个昂贵的过程，可能将潜在敏感的数据纳入服务器端模型。相比之下，这项工作着重于零拍的方法，该方法完全避免了填充的填充，而是使用审慎的说话者验证嵌入以及经过培训的共同训练的参考编码器来编码说话级信息，以帮助捕获诸如扬声器身份和韵律等方面。主观听力测试发现，嵌入和参考编码器都可以改善扬声器身份的连续性和编辑的合成语音和未经编辑的原始录音之间在零拍设置中的连续性。

Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the edited synthetic speech and unedited original recording in the zero-shot setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题