营地：在上下文中建模韵律的两阶段方法

论文标题

营地：在上下文中建模韵律的两阶段方法

CAMP: a Two-Stage Approach to Modelling Prosody in Context

论文作者

Hodari, Zack, Moinet, Alexis, Karlapati, Sri, Lorenzo-Trueba, Jaime, Merritt, Thomas, Joly, Arnaud, Abbas, Ammar, Karanasou, Penny, Drugman, Thomas

论文摘要

韵律是交流不可或缺的一部分，但仍然是最先进的语音综合中的一个空缺问题。建模韵律时，面临两个主要问题：（1）与声学信号中的其他内容相比，韵律以较慢的速度变化（例如，分段信息和背景噪声）；（2）在没有足够背景的情况下确定适当的韵律是一个问题。在本文中，我们建议解决这两个问题的解决方案。为了减轻建模慢速信号的挑战，我们学会使用单词级别表示删除韵律信息。为了减轻韵律建模的不足性质，我们使用源自文本得出的句法和语义信息来学习韵律空间的上下文依赖性。我们的上下文感知韵律模型（CAMP）的表现优于最先进的技术，以自然语音的差距减少了26％。我们还发现，用共同训练的持续时间模型取代注意力可显着改善韵律。

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题