使用多对话的TTS综合质心语音进行违反语音识别

论文标题

使用多对话的TTS综合质心语音进行违反语音识别

Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition

论文作者

Soleymanpour, Mohammad, Johnson, Michael T., Soleymanpour, Rahim, Berry, Jeffrey

论文摘要

构想障碍是一种运动言语障碍，通常是通过缓慢，不协调的语音生产肌肉控制语音清晰度降低的特征。自动语音识别（ASR）系统可以帮助违反障碍说话者进行更有效的沟通。要具有强大的构音障碍特异性ASR，需要足够的培训语音，这是不容易获得的。文本到语音（TTS）合成多演讲者端到端TTS系统的最新进展表明，将合成用于数据增强的可能性。在本文中，我们旨在改善多扬声器的端到端TTS系统，以合成质心语音，以改善对质心特异性DNN-HMM ASR的训练。在合成的语音中，我们将质心纹状体的严重程度和暂停插入机制添加到其他控制参数，例如音高，能量和持续时间。结果表明，与基线相比，接受其他合成质心语音训练的DNN-HMM模型的提高为12.2％，严重程度和暂停插入控制的增加可降低6.5％，显示添加这些参数的有效性。音频样本可在

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end TTS systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Audio samples are available at

下载PDF全文

下载文献需遵守相关版权规定

论文标题