通过微调自我监督的成人语音表示来改善儿童的语音识别

论文标题

通过微调自我监督的成人语音表示来改善儿童的语音识别

Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations

论文作者

Lu, Renee, Shahin, Mostafa, Ahmed, Beena

论文摘要

在建立包容性言论技术时，儿童的言语识别是一个至关重要但很大程度上被忽视的领域。阻碍该领域进步的主要挑战是缺乏足够的儿童言语语料库。但是，自我监督学习的最新进展为克服这一数据稀缺问题创造了新的机会。在本文中，我们利用自我监督的成人语音表现形式，并使用三个著名的儿童演讲语料库来建立儿童语音识别模型。我们评估了对本地和非本地儿童言语的微调表现，检查跨域儿童语料库的效果，并调查微调调整模型所需的最低儿童语音量，该模型胜过最先进的成人模型。我们还分析了儿童年龄段的语音识别表现。我们的结果表明，对跨域儿童语料库进行微调导致本地和非本地儿童语音的相对改善分别为46.08％和45.53％，绝对改善14.70％和31.10％。我们还表明，只需抄录了5个小时的儿童演讲，就可以微调儿童的言语识别系统，该系统的表现优于在960个小时的成人演讲中微调的最先进的成人模型。

Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies. The major challenge impeding progress in this domain is the lack of adequate child speech corpora; however, recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity. In this paper, we leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition. We assess the performance of fine-tuning on both native and non-native children's speech, examine the effect of cross-domain child corpora, and investigate the minimum amount of child speech required to fine-tune a model which outperforms a state-of-the-art adult model. We also analyze speech recognition performance across children's ages. Our results demonstrate that fine-tuning with cross-domain child corpora leads to relative improvements of up to 46.08% and 45.53% for native and non-native child speech respectively, and absolute improvements of 14.70% and 31.10%. We also show that with as little as 5 hours of transcribed children's speech, it is possible to fine-tune a children's speech recognition system that outperforms a state-of-the-art adult model fine-tuned on 960 hours of adult speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题