论文标题
仔细观察空间建模:从注意力到卷积
A Close Look at Spatial Modeling: From Attention to Convolution
论文作者
论文摘要
由于有见地的建筑设计和注意力机制,视觉变压器最近对许多视觉任务表现出了很大的希望。通过重新审视变形金刚中的自我发场反应,我们从经验上观察到了两个有趣的问题。首先,视觉变压器在深层层中提出了一种查询性的行为,在该行为中,无论查询斑块位置如何(也是头部核能),注意图在全球范围中几乎显示出一致的环境。其次,注意图本质上是稀疏的,几乎没有令牌主导着注意力的权重。引入Convnet的知识将在很大程度上平滑注意力并提高表现。在上述观察中,我们将自我注意的表述概括为直接抽象出一个查询的全球环境,并将全球环境进一步整合到卷积中。最终的模型是一种完全卷积的视觉变压器(即FCVIT),纯粹由卷积层组成,并坚定地继承了注意机制和杂音的优点,包括动态特性,重量共享以及短期和远程特征建模等。实验结果表明了FCVIT的有效性。在Imagenet-1k上,我们的FCVIT-S12少于1400万参数的优于相关的工作lite top1精度。将FCVIT缩放到较大的型号时,我们的性能仍然比以前的最先进的Convnext更好。基于FCVIT的模型还展示了对对象检测,实例分割和语义分割等下游任务的有希望的传递性。代码和模型可在以下网址提供:https://github.com/ma-xu/fcvit。
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.