用随机预测量化器进行语音识别的自我监督学习

论文标题

用随机预测量化器进行语音识别的自我监督学习

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

论文作者

Chiu, Chung-Cheng, Qin, James, Zhang, Yu, Yu, Jiahui, Wu, Yonghui

论文摘要

我们提出了一种简单有效的自我监督学习方法，以供语音识别。该方法以随机预测量化器生成的离散标签的形式学习了一个模型，以预测蒙版的语音信号。尤其是量化器的语音输入带有随机初始化的矩阵，并在随机定位的代码簿中进行最近的邻居查找。在自我监督的学习过程中，矩阵和代码簿均未更新。由于未对随机预测量化器进行训练，并与语音识别模型分开，因此该设计使该方法具有灵活性，并且与通用语音识别体系结构兼容。在LibrisPeech上，我们的方法与以前的工作相比，使用非流程模型获得了与以前的工作相似的单词率，并且与WAV2VEC 2.0和W2V-BERT相比，使用非流动模型提供了较低的单词误差和延迟。在多语言任务上，该方法还提供了与WAV2VEC 2.0和W2V-BERT的显着改进。

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题