仔细观察空间建模：从注意力到卷积

论文标题

仔细观察空间建模：从注意力到卷积

A Close Look at Spatial Modeling: From Attention to Convolution

论文作者

Ma, Xu, Wang, Huan, Qin, Can, Li, Kunpeng, Zhao, Xingchen, Fu, Jie, Fu, Yun

论文摘要

由于有见地的建筑设计和注意力机制，视觉变压器最近对许多视觉任务表现出了很大的希望。通过重新审视变形金刚中的自我发场反应，我们从经验上观察到了两个有趣的问题。首先，视觉变压器在深层层中提出了一种查询性的行为，在该行为中，无论查询斑块位置如何（也是头部核能），注意图在全球范围中几乎显示出一致的环境。其次，注意图本质上是稀疏的，几乎没有令牌主导着注意力的权重。引入Convnet的知识将在很大程度上平滑注意力并提高表现。在上述观察中，我们将自我注意的表述概括为直接抽象出一个查询的全球环境，并将全球环境进一步整合到卷积中。最终的模型是一种完全卷积的视觉变压器（即FCVIT），纯粹由卷积层组成，并坚定地继承了注意机制和杂音的优点，包括动态特性，重量共享以及短期和远程特征建模等。实验结果表明了FCVIT的有效性。在Imagenet-1k上，我们的FCVIT-S12少于1400万参数的优于相关的工作lite top1精度。将FCVIT缩放到较大的型号时，我们的性能仍然比以前的最先进的Convnext更好。基于FCVIT的模型还展示了对对象检测，实例分割和语义分割等下游任务的有希望的传递性。代码和模型可在以下网址提供：https：//github.com/ma-xu/fcvit。

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题