探索长期蒙面的自动编码器

论文标题

探索长期蒙面的自动编码器

Exploring Long-Sequence Masked Autoencoders

论文作者

Hu, Ronghang, Debnath, Shoubhik, Xie, Saining, Chen, Xinlei

论文摘要

蒙版自动编码（MAE）已成为跨多个领域预训练表示的有效方法。与自然语言的离散令牌相反，图像MAE的输入是连续的，并且需要其他规格。我们在训练前阶段系统地研究每个输入规范，找到序列长度是进一步扩展MAE的关键轴。我们的研究通过将面膜大小与贴片大小解耦，从而导致了MAE的长期版本，对原始食谱的变化很小。为了进行对象检测和语义分割，我们的长期序列MAE在传输过程中没有额外的计算成本显示了所有实验设置的一致增长。虽然长期序列的预训练对检测和分割最有益，但我们也通过保持标准图像大小并仅增加序列长度来对Imagenet-1K分类获得强劲的结果。我们希望我们的发现可以为计算机视觉扩展提供新的见解和途径。

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题