通过掩盖的多模式群集预测来学习视听语音表示

论文标题

通过掩盖的多模式群集预测来学习视听语音表示

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

论文作者

Shi, Bowen, Hsu, Wei-Ning, Lakhotia, Kushal, Mohamed, Abdelrahman

论文摘要

语音的视频记录包含相关的音频和视觉信息，为语音表示从说话者的唇部运动和产生的声音提供了强烈的信号。我们介绍了视听隐藏单元BERT（AV-HUBERT），这是一个自我监督的表示框架，用于视听语音，该框架掩盖了多流视频输入并预测自动发现并迭代完善的多模式隐藏单元。 Av-Hubert学习了有力的视听语音表示形式，从而使唇部阅读和自动语音识别受益。在最大的公共唇读基准LRS3（433小时）上，AV-Hubert仅使用30小时的标签数据获得32.5％的速度，表现优于以前的最先进方法（33.6％），接受了一千倍的转录视频数据（3.6％）。当使用所有433小时的LRS3标记数据并与自我训练结合使用时，唇部阅读将进一步降低至26.9％。在相同基准上使用我们的视听表示，仅通过音频语音识别，可导致比最新性能的相对降低40％（1.3％vs 2.3％）。我们的代码和型号可在https://github.com/facebookresearch/av_hubert上找到

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

下载PDF全文

下载文献需遵守相关版权规定

论文标题