自我监督语音模型的有效适配器转移自动语音识别

论文标题

自我监督语音模型的有效适配器转移自动语音识别

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

论文作者

Thomas, Bethan, Kessler, Samuel, Karout, Salah

论文摘要

自我监督学习（SSL）是一种强大的工具，可以从未标记的数据中学习基础表示。基于变压器的模型，例如WAV2VEC 2.0和HUBERT，在语音域中领导着该领域。通常，这些模型对少量标记的数据进行了微调，以进行下游任务，例如自动语音识别（ASR）。这涉及重新训练每个任务的大多数模型。适配器是小型轻质模块，在自然语言处理（NLP）中通常使用，以使预训练的模型适应新任务。在本文中，我们建议将适配器应用于WAV2VEC 2.0，以减少下游ASR任务所需的参数数量，并将模型的可扩展性提高到多个任务或语言。使用适配器，我们可以执行ASR，同时训练每个任务的少于10％的参数，而不是进行全面调整而几乎没有降级。消融表明，将适配器应用于预训练网络的前几层，具有与完全转移的相似性能，这支持了较高的预训练层编码更多音素信息并进一步优化效率的理论。

Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these models are fine-tuned on a small amount of labeled data for a downstream task such as Automatic Speech Recognition (ASR). This involves re-training the majority of the model for each task. Adapters are small lightweight modules which are commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. In this paper we propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks, and increase scalability of the model to multiple tasks or languages. Using adapters we can perform ASR while training fewer than 10% of parameters per task compared to full fine-tuning with little degradation of performance. Ablations show that applying adapters into just the top few layers of the pre-trained network gives similar performance to full transfer, supporting the theory that higher pre-trained layers encode more phonemic information, and further optimizing efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题