低资源语音识别的文本到语音数据增强

论文标题

低资源语音识别的文本到语音数据增强

Text-To-Speech Data Augmentation for Low Resource Speech Recognition

论文作者

Zevallos, Rodolfo

论文摘要

如今，用于开发自动语音识别（ASR）模型的深度学习技术的主要问题是缺乏转录数据。这项研究的目的是提出一种新的数据增强方法，以改善ASR模型的凝集性和低资源语言。这种新颖的数据增强方法同时生成合成文本和合成音频。一些实验是使用Quechua语言的语料库进行的，Quechua语言是一种凝集性和低资源语言。在这项研究中，除了使用Quechua的文本到语音（TTS）模型生成合成语音外，还应用了序列到序列（SEQ2SEQ）模型。结果表明，新的数据增强方法可以很好地改善Quechua的ASR模型。在这项研究中，使用合成文本和合成语音的组合获得了ASR模型的单词率（WER）的8.73％提高。

Nowadays, the main problem of deep learning techniques used in the development of automatic speech recognition (ASR) models is the lack of transcribed data. The goal of this research is to propose a new data augmentation method to improve ASR models for agglutinative and low-resource languages. This novel data augmentation method generates both synthetic text and synthetic audio. Some experiments were conducted using the corpus of the Quechua language, which is an agglutinative and low-resource language. In this study, a sequence-to-sequence (seq2seq) model was applied to generate synthetic text, in addition to generating synthetic speech using a text-to-speech (TTS) model for Quechua. The results show that the new data augmentation method works well to improve the ASR model for Quechua. In this research, an 8.73% improvement in the word-error-rate (WER) of the ASR model is obtained using a combination of synthetic text and synthetic speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题