查看唤醒单词：视听关键字发现

论文标题

查看唤醒单词：视听关键字发现

Seeing wake words: Audio-visual Keyword Spotting

论文作者

Momeni, Liliane, Afouras, Triantafyllos, Stafylakis, Themos, Albanie, Samuel, Zisserman, Andrew

论文摘要

这项工作的目的是自动确定会说话的面孔，无论有或没有音频，是否及其何时说话。我们提出了一种适合野生视频中的零拍方法。我们的主要贡献是：（1）一种新颖的卷积体系结构KWS-NET，它使用相似性映射中间表示将任务分为（i）序列匹配，以及（ii）模式检测，以确定单词是否存在以及何时存在；（2）我们证明，如果可用音频，视觉关键字斑点可以改善表演，以确保清洁和嘈杂的音频信号。最后，（3）我们证明，我们的方法通用了其他语言，特别是法语和德语，并通过对英语预先训练的网络进行微调，从而获得了与英语相当的性能。该方法在以同一基准和最先进的唇读方法的训练和测试时，超过了先前最先进的视觉关键字斑点架构的性能。

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题