论文标题
头像:无约束的视听语音识别
AVATAR: Unconstrained Audiovisual Speech Recognition
论文作者
论文摘要
视听自动语音识别(AV-ASR)是ASR的扩展,它通常来自说话者嘴的动作。与仅关注唇部运动的作品不同,我们研究了整个视觉框架(视觉动作,对象,背景等)的贡献。这对于不一定可见的无约束视频特别有用。为了解决这项任务,我们提出了一个新的序列到序列视听ASR变压器(Avatar),该序列是从频谱图和全帧RGB端到端训练的。为了防止音频流主导训练,我们提出了不同的单词掩盖策略,从而鼓励我们的模型注意视觉流。我们演示了视觉模态对2 AV-ASR基准测试的贡献,尤其是在模拟噪声的情况下,并表明我们的模型以大幅度的幅度优于所有其他先前的工作。最后,我们还为AV-ASR创建了一个名为Visspeech的新型现实世界测试床,该床在挑战性的音频条件下展示了视觉模态的贡献。
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.