论文标题

查看唤醒单词:视听关键字发现

Seeing wake words: Audio-visual Keyword Spotting

论文作者

Momeni, Liliane, Afouras, Triantafyllos, Stafylakis, Themos, Albanie, Samuel, Zisserman, Andrew

论文摘要

这项工作的目的是自动确定会说话的面孔,无论有或没有音频,是否及其何时说话。我们提出了一种适合野生视频中的零拍方法。我们的主要贡献是:(1)一种新颖的卷积体系结构KWS-NET,它使用相似性映射中间表示将任务分为(i)序列匹配,以及(ii)模式检测,以确定单词是否存在以及何时存在; (2)我们证明,如果可用音频,视觉关键字斑点可以改善表演,以确保清洁和嘈杂的音频信号。最后,(3)我们证明,我们的方法通用了其他语言,特别是法语和德语,并通过对英语预先训练的网络进行微调,从而获得了与英语相当的性能。该方法在以同一基准和最先进的唇读方法的训练和测试时,超过了先前最先进的视觉关键字斑点架构的性能。

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源