情绪可选的端到端基于文本的语音编辑

论文标题

情绪可选的端到端基于文本的语音编辑

Emotion Selectable End-to-End Text-based Speech Editing

论文作者

Wang, Tao, Yi, Jiangyan, Fu, Ruibo, Tao, Jianhua, Wen, Zhengqi, Zhang, Chu Yuan

论文摘要

基于文本的语音编辑允许用户通过直观地切割，复制和粘贴文本来编辑语音，以加快编辑语音的过程。在先前的工作中，提议CampNet（上下文感知的面具预测网络）实现基于文本的语音编辑，从而大大提高了编辑的语音质量。本文的目的是针对一项新任务：在基于文本的语音编辑过程中为编辑语音添加情感效果，以使生成的语音更具表现力。为了实现这项任务，我们提出了Emo-campnet（Emotion Campnet），它可以在基于文本的语音编辑中为产生的语音提供情感属性的选项，并具有编辑Uney Dey Ensey Speaker语音的单声能力。首先，我们提出了一个基于文本的语音编辑模型。该模型的关键思想是通过基于上下文感知的面具预测网络引入其他情感属性来控制产生的语音的情绪。其次，为了防止原始语音中的情感组成部分干扰所产生的语音的情绪，提出了中性内容发生器来消除原始语音中的情感，该语音通过生成的对抗框架优化。第三，提出了两种数据增强方法来丰富训练集中的情感和发音信息，这可以使模型能够编辑看不见的说话者的语音。 1）在基于文本的语音编辑过程中，Emo campnet可以有效地控制发言的情绪的实验结果；并可以编辑看不见的演讲者的演讲。 2）详细的消融实验进一步证明了情绪选择性和数据增强方法的有效性。演示页面可在https://hairuo55.github.io/emo-campnet/上找到

Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/

下载PDF全文

下载文献需遵守相关版权规定

论文标题