通过傅立叶域分析在深视觉变压器中进行反抗滑：从理论到实践

论文标题

通过傅立叶域分析在深视觉变压器中进行反抗滑：从理论到实践

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

论文作者

Wang, Peihao, Zheng, Wenqing, Chen, Tianlong, Wang, Zhangyang

论文摘要

Vision Transformer（VIT）最近在计算机视觉问题上表现出了希望。但是，与卷积神经网络（CNN）不同，由于观察到的注意力崩溃或斑块均匀性，VIT的性能随着深度的增加而迅速饱和。尽管有几个经验解决方案，但研究此可伸缩性问题的严格框架仍然难以捉摸。在本文中，我们首先建立了一个严格的理论框架，以分析来自傅立叶频谱域的VIT特征。我们表明，自我发挥的机制固有地等于低通滤波器，这表明VIT缩放其深度时，低通滤波会导致特征地图仅保留其直流（DC）组件。然后，我们提出了两种直接但有效的技术，以减轻不良的低通限制。第一种被称为attnscale的技术将一个自我发项的块分解为低通和高通组件，然后重新恢复并结合了这两个过滤器以产生全通的自我发场矩阵。第二种技术称为“壮举”，重量重量重量在单独的频带上具有地图，以扩大高频信号。这两种技术都是有效且无参数的，同时有效地克服了相关的VIT训练伪像，例如注意力崩溃和斑块均匀性。通过无缝将我们的技术插入多个VIT变体，我们证明它们始终如一地帮助VIT从更深的架构中受益，从而“免费”获得高达1.1％的性能增长（例如，很少的参数开销）。我们在https://github.com/vita-group/vit-anti-oversmooth上公开发布我们的代码和预培训模型。

Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1% performance gains "for free" (e.g., with little parameter overhead). We publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题