通过数据增强，课程学习和多任务增强，提高了Distilhubert的鲁棒性，以表现出看不见的嘈杂条件

论文标题

通过数据增强，课程学习和多任务增强，提高了Distilhubert的鲁棒性，以表现出看不见的嘈杂条件

Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement

论文作者

Guimarães, Heitor R., Pimentel, Arthur, Avila, Anderson R., Rezagholizadeh, Mehdi, Falk, Tiago H.

论文摘要

自我监督的语音表示学习旨在从语音信号中提取有意义的因素，以后可以在不同的下游任务（例如语音和/或情感识别）中使用。但是，现有模型（例如休伯特）可能很大，因此可能不适合Edge语音应用。此外，现实的应用程序通常涉及噪音和房间混响破坏的语音，因此模型需要为这种环境因素提供强大的表示形式。在这项研究中，我们建立在所谓的Distilhubert模型的基础上，该模型将Hubert的一小部分散布在其原始大小的一小部分中，即三个修改：（i）用噪音和混响来增强培训数据，而学生模型则需要从教师模型中提炼清洁代表；（ii）引入一种课程学习方法，其中将增加的噪声水平作为模型训练引入，从而有助于收敛并创建更健壮的表示；（iii）引入一种多任务学习方法，该方法还可以通过蒸馏任务共同重建清洁波形，从而成为增强步骤，以确保代表性的其他环境鲁棒性。在三个出色任务上进行的实验表明，该方法不仅相对于原始的Distilhubert，而且还显示了原始Hubert的优势，因此显示了``在野外''边缘语音应用程序中提出的方法的优势。

Self-supervised speech representation learning aims to extract meaningful factors from the speech signal that can later be used across different downstream tasks, such as speech and/or emotion recognition. Existing models, such as HuBERT, however, can be fairly large thus may not be suitable for edge speech applications. Moreover, realistic applications typically involve speech corrupted by noise and room reverberation, hence models need to provide representations that are robust to such environmental factors. In this study, we build on the so-called DistilHuBERT model, which distils HuBERT to a fraction of its original size, with three modifications, namely: (i) augment the training data with noise and reverberation, while the student model needs to distill the clean representations from the teacher model; (ii) introduce a curriculum learning approach where increasing levels of noise are introduced as the model trains, thus helping with convergence and with the creation of more robust representations; and (iii) introduce a multi-task learning approach where the model also reconstructs the clean waveform jointly with the distillation task, thus also acting as an enhancement step to ensure additional environment robustness to the representation. Experiments on three SUPERB tasks show the advantages of the proposed method not only relative to the original DistilHuBERT, but also to the original HuBERT, thus showing the advantages of the proposed method for ``in the wild'' edge speech applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题