传播自己：探索无监督的视觉表示学习的像素级的一致性

论文标题

传播自己：探索无监督的视觉表示学习的像素级的一致性

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

论文作者

Xie, Zhenda, Lin, Yutong, Zhang, Zheng, Cao, Yue, Lin, Stephen, Hu, Han

论文摘要

无监督的视觉表示学习的对比学习方法已达到转移绩效的显着水平。我们认为，对比度学习的力量尚未完全释放，因为仅根据实例级借口任务对当前方法进行培训，从而导致表示可能是需要密集像素预测的下游任务的表现。在本文中，我们介绍了用于学习密集特征表示的像素级借口任务。第一个任务直接在像素级别应用对比度学习。我们还提出了一项像素到刺激的一致性任务，该任务可以产生更好的结果，甚至超过了最先进的方法。具体而言，它可以实现60.2 AP，41.4 / 40.5地图和77.2 MIOU，当转移到Pascal VOC对象检测（C4），COCO对象检测（FPN / C4）和CityScaps语义段，并使用Resnet-50型骨架网络进行语义细分，该网络是2.6 AP，0.8 AP，0.8 AP，0.8 / 1.0 MAP和1.0 MAP和1.0 MI的实例方法更好。此外，发现像素级借口任务不仅可以预训练，不仅对常规的骨干网络，而且对于用于密集的下游任务的头部网络也是有效的，并且是实例级对比度方法的补充。这些结果证明了在像素级别定义借口任务的强大潜力，并在无监督的视觉表示学习中提出了一条新的路径。代码可在\ url {https://github.com/zdaxie/pixpro}中找到。

Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first task directly applies contrastive learning at the pixel level. We additionally propose a pixel-to-propagation consistency task that produces better results, even surpassing the state-of-the-art approaches by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2 mIoU when transferred to Pascal VOC object detection (C4), COCO object detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50 backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the previous best methods built on instance-level contrastive learning. Moreover, the pixel-level pretext tasks are found to be effective for pre-training not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods. These results demonstrate the strong potential of defining pretext tasks at the pixel level, and suggest a new path forward in unsupervised visual representation learning. Code is available at \url{https://github.com/zdaxie/PixPro}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题