论文标题
实证研究结合了关于个性化自发语音综合的暂停的语言知识
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis
论文作者
论文摘要
我们为基于语言知识提供了一项针对个性化自发语音综合的全面实证研究。随着语音克隆的出现,需要进行阅读风格的语音综合,需要一种新的语音克隆范式,用于人类和自发的语音综合。因此,我们专注于个性化的自发言语综合,这些语音综合可以克隆个人的声音音色和语音反弹。具体来说,我们处理填充的暂停,这是言语造成的主要来源,众所周知,这在心理学和语言学中的语音产生和交流中起着重要作用。为了相对评估个性化的填充暂停插入和非个人化填充暂停预测方法,我们开发了一种语音合成方法,该方法使用非个人化的外部填充暂停预测指标,该方法接受了多演讲者的培训。结果阐明了填充的暂停的位置词,即,需要精确预测自然的位置,并且有必要精确地预测单词以评估综合语音的评估。
We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.