从关节表示的深层语音综合

论文标题

从关节表示的深层语音综合

Deep Speech Synthesis from Articulatory Representations

论文作者

Wu, Peter, Watanabe, Shinji, Goldstein, Louis, Black, Alan W, Anumanchipalli, Gopala K.

论文摘要

在关节综合任务中，语音是从包含有关人声道身体行为的信息的输入特征中综合的。这项任务为语音综合研究提供了一个有希望的方向，因为关节空间紧凑，平稳且可解释。当前的作品强调了深度学习模型执行发音合成的潜力。但是，尚不清楚这些模型是否可以实现人类语音生产系统的效率和忠诚度。为了帮助弥合这一差距，我们提出了一种时间域的关节合成方法，并证明了其通过电磁术（EMA）（EMA）和合成的关节特征输入的功效。我们的模型在计算上是有效的，对于EMA到语音任务的转录单词错误率（WER）为18.5％，与先前的工作相比，提高了11.6％。通过插值实验，我们还强调了方法的普遍性和解释性。

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is computationally efficient and achieves a transcription word error rate (WER) of 18.5% for the EMA-to-speech task, yielding an improvement of 11.6% compared to prior work. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题