自我监督的视频表示学习，并具有运动吸引的蒙版自动编码器

论文标题

自我监督的视频表示学习，并具有运动吸引的蒙版自动编码器

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

论文作者

Yang, Haosen, Huang, Deng, Wen, Bin, Wu, Jiannan, Yao, Hongxun, Jiang, Yi, Zhu, Xiatian, Yuan, Zehuan

论文摘要

蒙面自动编码器（MAE）最近出现了，随着艺术自我监视的时空代表学习者。但是，从图像对应物继承，现有的视频MAE仍主要集中在静态外观学习上，而学习动态时间信息的限制，因此对于视频下游任务的有效性较小。为了解决这一缺点，在这项工作中，我们提出了运动感知的变体-MotionMae。除了学习到重建单个视频框架的遮罩斑块外，我们的模型旨在额外预测一段时间内相应的运动结构信息。该运动信息可在附近框架的时间差上获得。结果，我们的模型可以自发地有效地提取静态外观和动态运动，从而导致高级时空表示能力。广泛的实验表明，在特定于域特异性和结构域的预审前的环境下，我们的运动素质的表现既优于监督学习基线和最先进的MAE替代方案。特别是，当使用VIT-B作为骨架时，我们的MotionMae在某些事物的V2上超过了先前的ART模型，而在域特异性预处理环境中，UCF101的余量为1.2％。令人鼓舞的是，在具有挑战性的视频对象细分任务上，它还超过了3％以上的竞争MAE。该代码可从https://github.com/happy-hsy/motionmae获得。

Masked autoencoders (MAEs) have emerged recently as art self-supervised spatiotemporal representation learners. Inheriting from the image counterparts, however, existing video MAEs still focus largely on static appearance learning whilst are limited in learning dynamic temporal information hence less effective for video downstream tasks. To resolve this drawback, in this work we present a motion-aware variant -- MotionMAE. Apart from learning to reconstruct individual masked patches of video frames, our model is designed to additionally predict the corresponding motion structure information over time. This motion information is available at the temporal difference of nearby frames. As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability. Extensive experiments show that our MotionMAE outperforms significantly both supervised learning baseline and state-of-the-art MAE alternatives, under both domain-specific and domain-generic pretraining-then-finetuning settings. In particular, when using ViT-B as the backbone our MotionMAE surpasses the prior art model by a margin of 1.2% on Something-Something V2 and 3.2% on UCF101 in domain-specific pretraining setting. Encouragingly, it also surpasses the competing MAEs by a large margin of over 3% on the challenging video object segmentation task. The code is available at https://github.com/happy-hsy/MotionMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题