论文标题
带有超级令牌抽样的视觉变压器
Vision Transformer with Super Token Sampling
论文作者
论文摘要
Vision Transformer在许多视觉任务中取得了令人印象深刻的性能。但是,在捕获浅层层的当地特征时,它可能会遭受高冗余。因此,利用了当地的自我注意力或早期卷积,牺牲了捕获长期依赖的能力。然后出现挑战:我们可以在神经网络的早期阶段访问有效有效的全球环境建模吗?为了解决这个问题,我们从超级像素的设计中汲取灵感,从而减少了随后的处理中的图像原始数量的数量,并将超级令牌引入Vision Transformer。超级令牌试图提供对视觉内容的语义有意义的镶嵌,从而减少了自我注意事项的令牌数量以及保留全球建模。具体而言,我们提出了一个简单而强大的超级令牌注意力(STA)机制,该机制具有三个步骤:第一个样本通过稀疏关联学习,从视觉代币中进行超级令牌,第二个对超级令牌进行了自我注意,最后将它们映射到原始的标记空间。 STA将香草全球关注分解为稀疏关联图的繁殖和低维注意的繁殖,从而高效地捕获了全球依赖性。基于STA,我们开发了一个层次视觉变压器。广泛的实验证明了其在各种视觉任务上的强劲表现。特别是,没有任何额外的训练数据或标签,它在Imagenet-1k上获得了86.4%的TOP-1准确性,参数少于100m。它还在可可检测任务上可以实现53.9个盒子AP和46.8蒙版AP,而在ADE20K语义分割任务上也可以实现51.9 miou。代码可在https://github.com/hhb072/stvit上发布。
Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized, which sacrifice the capacity to capture long-range dependency. A challenge then arises: can we access efficient and effective global context modeling at the early stages of a neural network? To address this issue, we draw inspiration from the design of superpixels, which reduces the number of image primitives in subsequent processing, and introduce super tokens into vision transformer. Super tokens attempt to provide a semantically meaningful tessellation of visual content, thus reducing the token number in self-attention as well as preserving global modeling. Specifically, we propose a simple yet strong super token attention (STA) mechanism with three steps: the first samples super tokens from visual tokens via sparse association learning, the second performs self-attention on super tokens, and the last maps them back to the original token space. STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies. Based on STA, we develop a hierarchical vision transformer. Extensive experiments demonstrate its strong performance on various vision tasks. In particular, without any extra training data or label, it achieves 86.4% top-1 accuracy on ImageNet-1K with less than 100M parameters. It also achieves 53.9 box AP and 46.8 mask AP on the COCO detection task, and 51.9 mIOU on the ADE20K semantic segmentation task. Code is released at https://github.com/hhb072/STViT.