论文标题
课程视听学习
Curriculum Audiovisual Learning
论文作者
论文摘要
在复杂的视听场景中关联声音及其生产者是一项艰巨的任务,尤其是当我们缺乏带注释的培训数据时。在本文中,我们提出了一个灵活的视听模型,该模型将软簇模块引入了音频和视觉内容检测器,并将视听并发的普遍特性视为推断检测到的内容之间相关性的潜在监督。为了减轻视听学习的困难,我们提出了一种新颖的课程学习策略,该策略将模型从简单的场景训练。我们表明,这种有序的学习过程奖励模型易于培训和快速融合的优点。同时,我们的视听模型还可以提供有效的单峰表示和跨模式对准性能。我们将训练有素的模型进一步部署到实用的视听声音本地化和分离任务中。我们表明,我们的本地化模型大大优于现有方法,基于我们在声音分离中显示出可比的性能而没有引用外部视觉监督。我们的视频演示可以在https://youtu.be/kuclfgg0cfu上找到。
Associating sound and its producer in complex audiovisual scene is a challenging task, especially when we are lack of annotated training data. In this paper, we present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. To ease the difficulty of audiovisual learning, we propose a novel curriculum learning strategy that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation task. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision. Our video demo can be found at https://youtu.be/kuClfGG0cFU.