论文标题
空间 - 暂时性的自我监督学习视频对应
Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
论文作者
论文摘要
在低级视频分析中,有效表示对于得出视频帧之间的对应关系很重要。在最近的一些研究中,使用精心设计的借口任务以从未标记的图像或视频中以自我监督的方式学习了这些表示形式。但是,先前的工作集中在空间歧视性特征或时间重复性特征上,很少关注空间和时间提示之间的协同作用。为了解决这个问题,我们提出了一种空间 - 时空的自我监督学习方法。具体而言,我们首先通过对比度学习从未标记的图像中提取空间特征,其次通过通过重建学习中的未标记视频中利用时间提示来增强功能。在第二步中,我们设计了全球相关蒸馏损失,以确保学习不要忘记空间提示,以及局部相关蒸馏损失,以打击损害重建的时间不连续性。该方法的表现优于最先进的自我监督方法,这是通过一系列基于通讯的视频分析任务所确定的。此外,我们进行了消融研究,以验证两步设计的有效性以及蒸馏损失。
In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.