论文标题
无监督的时间动作本地化任务的预训练
Unsupervised Pre-training for Temporal Action Localization Tasks
论文作者
论文摘要
近年来,无监督的视频表示学习取得了非凡的成就。但是,大多数现有方法都是为视频分类而设计和优化的。由于视频级别的分类和剪辑级定位之间的固有差异,这些预训练的模型对于时间定位任务可能是最佳的时间定位任务。为了弥合这一差距,我们首次尝试提出一个自我监督的借口任务,以伪动作定位(PAL),以毫无根据的预训练特征编码器进行时间动作本地化任务(UP-TAL)。具体来说,我们首先从一个视频作为伪动作中随机选择每个剪辑的时间区域,每个区域都包含多个剪辑,然后将它们粘贴到其他两个视频的不同时间位置上。借口的任务是使粘贴伪行动区域的特征与两个综合视频相结合,并最大程度地达到它们之间的一致性。与现有的无监督视频表示学习方法相比,我们的PAL通过以时间密集和规模意识的方式引入时间上的对比度学习范式来更好地适应下游TAL任务。广泛的实验表明,PAL可以利用大规模的未标记视频数据来显着提高现有TAL方法的性能。我们的代码和模型将在https://github.com/zhang-can/up-tal上公开提供。
Unsupervised video representation learning has made remarkable achievements in recent years. However, most existing methods are designed and optimized for video classification. These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization. To bridge this gap, we make the first attempt to propose a self-supervised pretext task, coined as Pseudo Action Localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action Localization tasks (UP-TAL). Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos. The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them. Compared to the existing unsupervised video representation learning approaches, our PAL adapts better to downstream TAL tasks by introducing a temporal equivariant contrastive learning paradigm in a temporally dense and scale-aware manner. Extensive experiments show that PAL can utilize large-scale unlabeled video data to significantly boost the performance of existing TAL methods. Our codes and models will be made publicly available at https://github.com/zhang-can/UP-TAL.