释放带有蒙版图像建模的香草视觉变压器用于对象检测

论文标题

释放带有蒙版图像建模的香草视觉变压器用于对象检测

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

论文作者

Fang, Yuxin, Yang, Shusheng, Wang, Shijie, Ge, Yixiao, Shan, Ying, Wang, Xinggang

论文摘要

我们提出了一种方法，可有效，有效地适应蒙版的图像建模（MIM）预训练的香草视觉变压器（VIT）进行对象检测，这是基于我们的两个新颖的观察结果：（i）MIM预先培训的Vanilla Vit Endoder可以在挑战性的对象级别的Simbles $ 50以上，即使在50％的情况下，在挑战性的对象级别的情况下，可以出人意料地工作。输入嵌入。（ii）为了构建单尺度VIT的对象检测的多尺度表示，随机初始初始化的紧凑型卷积茎取代了预先训练的大型核修补茎，并且其中间特征可以自然地作为特征金字塔网络的较高分辨率输入而无需进一步升级或其他操纵。虽然预训练的VIT仅被视为检测器骨干而不是整个功能提取器的3 $^{rd} $。这会导致Convnet-Vit混合特征提取器。所提出的名为MIMDET的拟议探测器使MIM预先训练的Vanilla VIT可以在可可上用2.5盒AP和2.6 Mask AP优于层次的Swin Transformer，并且与以前的最佳适应性Vanilla Vit检测器相比，使用更适度的微调回收仪，取得了更好的结果，而收集了2.8 $ \ tiles times $ \ tirs times $ faster。代码和预培训模型可在https://github.com/hustvl/mimdet上找到。

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% $\sim$ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3$^{rd}$-stage of our detector's backbone instead of the whole feature extractor. This results in a ConvNet-ViT hybrid feature extractor. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 box AP and 2.6 mask AP on COCO, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8$\times$ faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题