论文标题
Campnet:基于端到端文本语音编辑的上下文感知的面具预测
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
论文作者
论文摘要
基于文本的语音编辑器允许通过直观的切割,复制和粘贴操作来编辑语音,以加快编辑语音的过程。但是,当前系统的主要缺点是编辑的语音通常由于切割拷贝性操作而听起来不自然。另外,如何根据成绩单中未出现的新单词合成记录并不明显。本文提出了一种新颖的基于文本的语音编辑方法,称为“上下文感知面具预测网络”(CAMPNET)。该模型可以通过随机掩盖语音的一部分,然后通过感知语音上下文来预测掩盖区域,从而模拟基于文本的语音编辑过程。它可以解决编辑区域中不自然的韵律,并综合与笔录中看不见的单词相对应的语音。其次,对于基于文本的语音编辑的可能操作,我们设计了基于Campnet的三个基于文本的操作:删除,插入和替换。这些操作可以涵盖语音编辑的各种情况。第三,为了综合与插入和替换操作中长文相对应的语音,提出了一种单词级自回归的生成方法。第四,我们建议使用campnet的一句话提出一种演讲者的适应方法,并探索基于Campnet的几次学习的能力,这为语音伪造任务提供了新的想法。 VCTK和Libritts数据集上的主观和客观实验表明,基于cooknet的语音编辑结果优于TTS技术,手动编辑和VOCO方法。我们还进行了详细的消融实验,以探索露营结构对其性能的影响。最后,实验表明,只有一个句子的说话者适应可以进一步改善言语的自然性。可以在https://hairuo55.github.io/campnet上找到生成的语音的示例。
The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design three text-based operations based on CampNet: deletion, insertion, and replacement. These operations can cover various situations of speech editing. Thirdly, to synthesize the speech corresponding to long text in insertion and replacement operations, a word-level autoregressive generation method is proposed. Fourthly, we propose a speaker adaptation method using only one sentence for CampNet and explore the ability of few-shot learning based on CampNet, which provides a new idea for speech forgery tasks. The subjective and objective experiments on VCTK and LibriTTS datasets show that the speech editing results based on CampNet are better than TTS technology, manual editing, and VoCo method. We also conduct detailed ablation experiments to explore the effect of the CampNet structure on its performance. Finally, the experiment shows that speaker adaptation with only one sentence can further improve the naturalness of speech. Examples of generated speech can be found at https://hairuo55.github.io/CampNet.