自我监督的视觉预训练的损坏的图像建模

论文标题

自我监督的视觉预训练的损坏的图像建模

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

论文作者

Fang, Yuxin, Dong, Li, Bao, Hangbo, Wang, Xinggang, Wei, Furu

论文摘要

我们引入了损坏的图像建模（CIM），以进行自我监督的视觉预训练。 CIM使用带有小型训练BEIT的辅助发电机来损坏输入图像，而不是使用人工[mask]令牌，其中某些补丁是随机选择的，并用从BEIT输出分布采样的合理替代方案代替。鉴于此损坏的映像，增强器网络学会了要恢复所有原始图像像素，或者预测每个视觉令牌是否被发电机样本代替。发电机和增强器同时训练并协同更新。预训练后，增强器可以用作下游任务的高容量视觉编码器。 CIM是一种适用于各种网络体系结构的通用且灵活的视觉预训练框架。 CIM首次证明了VIT和CNN都可以使用统一的非锡安式框架学习丰富的视觉表示。实验结果表明，我们的方法实现了令人信服的结果基准，例如ImageNet分类和ADE20K语义分割。

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题