无监督的跨语性表示语音识别学习

论文标题

无监督的跨语性表示语音识别学习

Unsupervised Cross-lingual Representation Learning for Speech Recognition

论文作者

Conneau, Alexis, Baevski, Alexei, Collobert, Ronan, Mohamed, Abdelrahman, Auli, Michael

论文摘要

本文介绍了XLSR，该XLSR通过从多种语言的原始语音中验证单个模型来学习跨语性的语音表示。我们建立在WAV2VEC 2.0上，该WAV 2VEC 2.0通过解决蒙面潜在语音表示的对比任务进行训练，并共同了解对跨语言共享的潜伏期的量化。对所得模型进行了微调，并在标记的数据上进行了微调，并且实验表明，跨语性预读的明显优于单语言预处理。与最著名的结果相比，XLSR在公共视觉基准测试中显示了相对音素错误率降低72％。在Babel上，与可比系统相比，我们的方法将单词错误率提高了16％。我们的方法实现了一个多语言语音识别模型，该模型在强大的单个模型中具有竞争力。分析表明，潜在的离散语音表示形式在相关语言的共享中共享。我们希望通过释放XLSR-53的XLSR-53（一种以53种语言预读的大型模型）来催化低资源语音理解的研究。

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing XLSR-53, a large model pretrained in 53 languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题