论文标题
贴片级表示自我监督的视觉变压器的学习
Patch-level Representation Learning for Self-supervised Vision Transformers
论文作者
论文摘要
最近的自我监督学习(SSL)方法在从未标记的图像中学习视觉表示方面显示出令人印象深刻的结果。本文旨在通过利用基础神经网络的建筑优势进一步提高其性能,因为SSL的当前最新视觉借口任务不享受好处,即它们是建筑 - 敏捷的。特别是,我们专注于视觉变形金刚(VIT),这些变压器最近作为更好的建筑选择而受到了很多关注,通常优于各种视觉任务的卷积网络。 VIT的独特特征在于,它从图像中采取了一系列不交联补丁,并在内部处理补丁级表示。受此启发的启发,我们设计了一个简单而有效的视觉借口任务,即创造了自我绘制,以学习更好的补丁级表示。具体而言,我们对每个贴片及其邻居强制执行不变性,即每个贴片都将相似的相邻贴片视为正样品。因此,用自我绘制的培训可以学会在斑块之间学习更有意义的关系(不使用人类通知的标签),这可能是有益的,特别是对密集预测类型的下游任务。尽管它很简单,但我们证明它可以显着提高各种视觉任务(包括对象检测和语义分割)的现有SSL方法的性能。具体而言,通过在可可对象检测上实现+1.3 ap,在可可实例分段上实现+1.2 ap,可以显着改善最近的自我监督的vit,dino,而在ADE20K语义分段上+2.9 miou。
Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.