预测和更新网络：受人言语感知启发的视听语音识别

论文标题

预测和更新网络：受人言语感知启发的视听语音识别

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

论文作者

Wang, Jiadong, Qian, Xinyuan, Li, Haizhou

论文摘要

音频和视觉信号在人类言语感知中相互补充，他们在语音识别中也是如此。就语音感知而言，视觉提示不如声学提示，但在复杂的声学环境中更健壮。这仍然是一个挑战，我们如何有效利用音频和视觉信号之间的相互作用来自动语音识别。有研究以同步方式将视觉信号作为冗余或互补信息。人类的研究表明，视觉信号素得出的听众在何时和何种频率上都要提前进行。我们提出了一个预测和更新网络（P＆U NET），以模拟这种视觉提示机制，以进行视听语音识别（AVSR）。特别是，我们首先根据视觉信号预测口语单词的字符后期（即视觉嵌入）。然后，通过一种新型的跨模式构象异构体在视觉嵌入在视觉嵌入中，该音频信号更新了字符后期。我们通过广泛的实验来验证视觉提示机制的有效性。所提出的P＆U NET在LRS2-BBC和LRS3-BBC数据集上均优于最先进的AVSR方法，在干净和嘈杂的条件下，相对降低的单词错误率（WER）S超过10％和40％。

Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the visual signals. The audio signal is then conditioned on the visual embedding via a novel cross-modal Conformer, that updates the character posteriors. We validate the effectiveness of the visual cueing mechanism through extensive experiments. The proposed P&U net outperforms the state-of-the-art AVSR methods on both LRS2-BBC and LRS3-BBC datasets, with the relative reduced Word Error Rate (WER)s exceeding 10% and 40% under clean and noisy conditions, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题