对基于变压器的语音识别的无监督预训练的进一步研究

论文标题

对基于变压器的语音识别的无监督预训练的进一步研究

A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

论文作者

Jiang, Dongwei, Li, Wubo, Zhang, Ruixiong, Cao, Miao, Luo, Ne, Han, Yang, Zou, Wei, Li, Xiangang

论文摘要

建立良好的语音识别系统通常需要大量的转录数据，这很昂贵。为了解决这个问题，已经提出了许多无监督的预训练方法。在这些方法中，掩盖的预测性编码可在各种语音识别数据集上具有显着改善，并具有类似Bert的掩盖重建损失和变压器主链。但是，MPC的许多方面尚未得到充分研究。在本文中，我们对MPC进行了进一步的研究，并关注三个重要方面：培训前数据说话风格的影响，其对流媒体模型的扩展以及如何更好地将学习的知识从训练阶段转移到下游任务。实验表明，具有匹配的口语样式的预训练数据对下游识别任务更有用。 APC和MPC的统一训练目标提供了在HKUST训练的流媒体模型上减少8.46％的相对误差。同样，目标数据适应和层层判别训练的组合有助于MPC的知识转移，在强大的基线上，Aishell的相对误差减少了3.99％。

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone. However, many aspects of MPC have not been fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks. Experiments reveled that pre-training data with matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided 8.46% relative error reduction on streaming model trained on HKUST. Also, the combination of target data adaption and layer-wise discriminative training helped the knowledge transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over a strong baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题