论文标题
部分可观测时空混沌系统的无模型预测
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
论文作者
论文摘要
我们提供了一个简单而有效的自我监督框架,用于视听表示学习,以将声源定位在视频中。为了了解什么使能够学习有用的表示形式,我们系统地研究了数据增强的效果,并揭示(1)数据增强的组成起着关键作用,即明确鼓励视听表示对各种变换〜({\ em em Transcomptation nortrancation}); (2)执行几何一致性基本上提高了学会表示的质量,即检测到的声源应遵循在输入视频框架上应用的相同转换〜({\ em em Transformation equarivariancianciance})。广泛的实验表明,我们的模型在两个声音定位基准上(即Flickr-soundnet和vgg-sounds)上的先前方法显着胜过以前的方法。此外,我们还评估了音频检索和跨模式检索任务。在这两种情况下,我们的自我监管模型都表现出了出色的检索性能,甚至在音频检索中具有监督方法竞争。这揭示了所提出的框架学习了强大的多模式表示,这些表示有益于声音定位和对进一步应用的概括。 \ textIt {所有代码都将可用}。
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}.