论文标题

判别性多模式语音识别

Discriminative Multi-modality Speech Recognition

论文作者

Xu, Bo, Lu, Cheng, Guo, Yandong, Wang, Jacob

论文摘要

视觉通常被用作音频语音识别的互补方式(ASR),尤其是在嘈杂的环境中,在嘈杂的环境中,独奏音频形态的表现显着恶化。在结合视觉模态之后,将ASR升级到多模式语音识别(MSR)。在本文中,我们提出了一个两阶段的语音识别模型。在第一阶段,目标语音在唇部运动的相应视觉信息的帮助下与背景噪音分开,从而使模型清楚地“听”。在第二阶段,音频方式再次结合了视觉方式,以通过MSR子网络更好地理解语音,从而进一步提高了识别率。还有其他一些关键贡献:我们引入了基于伪-3D的残留卷积(P3D)的视觉前端,以提取更多的判别特征;我们使用时间卷积网络(TCN)将时间卷积块从1D重新连接升级,这更适合时间任务; MSR子网络建立在元素注意的门控复发单元(Eleatt-Gru)的顶部,该单元(Eleatt-Gru)在长序列中比变压器更有效。我们在LRS3-TED和LRW数据集上进行了广泛的实验。我们的两阶段模型(音频增强的多模式语音识别,AE-MSR)始终通过显着的余量来实现最新性能,这证明了AE-MSR的必要性和有效性。

Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源