论文标题

零射击的长格式语音克隆并动态卷积注意

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

论文作者

Gorodetskii, Artem, Ozhiganov, Ivan

论文摘要

随着语音克隆的最新进展,目标扬声器的语音综合性能与人类级别相似。但是,自回归的语音克隆系统仍然遭受文本对齐失败的影响,导致无法综合长句。在这项工作中,我们提出了一系列基于注意力的文本到语音系统,该系统可以从几秒钟的参考语音中重现目标语音,并将其推广到很长的话语。拟议的系统基于三个独立训练的组件:扬声器编码器,合成器和通用声码器。使用一种称为动态卷积注意力的基于能量的注意机制来实现对长话的概括,并结合基于TACOTRON 2的合成器提出的一系列修改。此外,通过在扬声器编码上既适合大量的多样corprus corprus corprius corprus corprus corprus corprus of Sopity ofders of Synthessizer and Vocoder来实现有效的零声音扬声器适应。我们比较了语音克隆系统的几种实施,从语音自然性,说话者的相似性,一致性的一致性和综合冗长的话语的能力方面进行了比较,并得出结论,拟议的模型可以为极长的话语产生可理解的合成语音,同时保留了很大的自然性和相似性。

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by conditioning both the synthesizer and vocoder on a speaker encoder that has been pretrained on a large corpus of diverse data. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances, and conclude that the proposed model can produce intelligible synthetic speech for extremely long utterances, while preserving a high extent of naturalness and similarity for short texts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源