视觉变压器可证明学习空间结构

论文标题

视觉变压器可证明学习空间结构

Vision Transformers provably learn spatial structure

论文作者

Jelassi, Samy, Sander, Michael E., Li, Yuanzhi

论文摘要

与计算机视觉中的卷积神经网络（CNN）相比，视觉变压器（VIT）取得了可比或优越的性能。这种经验的突破甚至更为显着，因为与CNN相比，VIT并没有嵌入空间位置的任何视觉感应偏置。然而，最近的作品表明，在最大程度地减少训练损失的同时，VIT专门学习了空间本地化的模式。这提出了一个核心问题：如何通过使用基于随机初始化的基于梯度的方法最大程度地减少其训练损失来学习这些模式？在本文中，我们提供了这种现象的一些理论理由。我们提出了一个空间结构化的数据集和简化的VIT模型。在此模型中，注意矩阵仅取决于位置编码。我们将此机制称为位置注意机制。在理论方面，我们考虑了一项二进制分类任务，并表明，尽管学习问题允许多个概括的解决方案，但我们的模型在推广时隐含地学习了数据集的空间结构：我们称这种现象贴片关联。我们证明，补丁协会有助于样本有效地传输到具有与预训练相同结构但特征不同的结构相同结构的下游数据集中。最后，我们从经验上验证了具有位置注意力的VIT的性能与CIFAR-10/100，SVHN和Imagenet上的原始vit相似。

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any visual inductive bias of spatial locality. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns. This raises a central question: how do ViTs learn these patterns by solely minimizing their training loss using gradient-based methods from random initialization? In this paper, we provide some theoretical justification of this phenomenon. We propose a spatially structured dataset and a simplified ViT model. In this model, the attention matrix solely depends on the positional encodings. We call this mechanism the positional attention mechanism. On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association. We prove that patch association helps to sample-efficiently transfer to downstream datasets that share the same structure as the pre-training one but differ in the features. Lastly, we empirically verify that a ViT with positional attention performs similarly to the original one on CIFAR-10/100, SVHN and ImageNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题