调查自我监督，弱监督和完全监督的多域自动语音识别方法：孟加拉国孟加拉国的研究

论文标题

调查自我监督，弱监督和完全监督的多域自动语音识别方法：孟加拉国孟加拉国的研究

Investigating self-supervised, weakly supervised and fully supervised training approaches for multi-domain automatic speech recognition: a study on Bangladeshi Bangla

论文作者

Samin, Ahnaf Mozib, Kobir, M. Humayon, Rafee, Md. Mushtaq Shahriyar, Ahmed, M. Firoz, Hasan, Mehedi, Ghosh, Partha, Kibria, Shafkat, Rahman, M. Shahidur

论文摘要

尽管采用神经网络的自动语音识别（ASR）有了很大的改善，但ASR系统仍缺乏由于域移动而引起的可靠性和普遍性问题。这主要是因为在编译ASR数据集时，通常未对主要语料库设计标准进行充分的识别和检查。在这项研究中，我们研究了最先进的转移学习方法的鲁棒性，例如自我监督的WAV2VEC 2.0，弱监督的耳语以及全面有监督的卷积神经网络（CNNS），用于多域ASR。我们还通过在新型多域孟加拉国ASR ASR评估基准-Banspeech上评估这些模型来构建这些模型时，在构建语料库的同时展示了域选择的重要性，该基准包含大约6.52小时的人类宣传的语音和8085年的8085个不同领域的话语。 Subak.ko是孟加拉语言丰富语言的大多数读的语音语料库，已被用来训练ASR系统。实验评估表明，与弱监督和完全监督以应对多域ASR任务相比，自我监督的跨语义预训练是最好的策略。此外，在subak.ko上接受过培训的ASR模型面临困难，从而识别出大部分自发演讲的域中的言语。 Banspeech将公开使用，以满足孟加拉ASR的挑战性评估基准的需求。

Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to domain shifting. This is mainly because principal corpus design criteria are often not identified and examined adequately while compiling ASR datasets. In this study, we investigate the robustness of the state-of-the-art transfer learning approaches such as self-supervised wav2vec 2.0 and weakly supervised Whisper as well as fully supervised convolutional neural networks (CNNs) for multi-domain ASR. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multi-domain Bangladeshi Bangla ASR evaluation benchmark - BanSpeech, which contains approximately 6.52 hours of human-annotated speech and 8085 utterances from 13 distinct domains. SUBAK.KO, a mostly read speech corpus for the morphologically rich language Bangla, has been used to train the ASR systems. Experimental evaluation reveals that self-supervised cross-lingual pre-training is the best strategy compared to weak supervision and full supervision to tackle the multi-domain ASR task. Moreover, the ASR models trained on SUBAK.KO face difficulty recognizing speech from domains with mostly spontaneous speech. The BanSpeech will be publicly available to meet the need for a challenging evaluation benchmark for Bangla ASR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题