论文标题
部分可观测时空混沌系统的无模型预测
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
论文作者
论文摘要
蒙版自动编码器(MAES)通过重建可见数据的令牌来了解图像,文本,音频,视频等的可通用表示形式。当前的视频方法依赖于随机补丁,管子或基于框架的掩蔽策略来选择这些令牌。本文提出了Adamae,这是MAE的一种自适应掩盖策略,可端到端训练。我们使用辅助采样网络根据语义上下文可见的自适应掩蔽策略样本可见令牌。该网络估计了时空键盘令牌的分类分布。增加预期重建误差的代币将得到奖励,并选择为可见的令牌,这是由策略梯度算法在增强学习中的动机所激发的。我们表明,Adamae从高时空信息区域采样了更多的令牌,从而使我们掩盖了95%的令牌,从而导致记忆需求较低和更快的预训练。我们对某些事物V2(SSV2)数据集进行了消融研究,以证明我们的自适应采样方法的功效,并报告SSV2和Kinetics-400动作分类数据集的最先进结果,并在TOP-1的准确性上具有70.0%和81.7%的效果,并具有Vit-Base Backase Backone Backbone和800 Pre-epochs。
Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.