多模式的联想桥接通过记忆：语音从面部视频中回忆

论文标题

多模式的联想桥接通过记忆：语音从面部视频中回忆

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

论文作者

Kim, Minsu, Hong, Joanna, Park, Se Jin, Ro, Yong Man

论文摘要

在本文中，我们介绍了一种新型的视听多模式桥接框架，即使使用Uni-Modal输入也可以同时利用音频和视觉信息。我们利用一个存储源（即视觉）和目标（即音频）模态表示的内存网络，其中源模态表示是我们的，而目标模态表示是我们从内存网络中获得的。然后，我们在源和目标记忆之间构建了一个关联桥，该桥梁考虑了两个记忆之间的相互关系。通过通过关联桥学习相互关系，即使仅使用源模态输入，提出的桥接框架也能够在内存网络中获得目标模态表示，并且为其下游任务提供了丰富的信息。我们将提出的框架应用于两个任务：静音视频中的唇部阅读和语音重建。通过拟议的关联桥和特定于模式的记忆，每个任务知识都充满了召回的音频上下文，从而实现了最新的性能。我们还验证了关联桥是否正确地关联了源和目标记忆。

In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal representations are what we want to obtain from the memory network. We then construct an associative bridge between source and target memories that considers the interrelationship between the two memories. By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks. We apply the proposed framework to two tasks: lip reading and speech reconstruction from silent video. Through the proposed associative bridge and modality-specific memories, each task knowledge is enriched with the recalled audio context, achieving state-of-the-art performance. We also verify that the associative bridge properly relates the source and target memories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题