通过凸双重性揭示注意力：视觉变压器的分析和解释

论文标题

通过凸双重性揭示注意力：视觉变压器的分析和解释

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

论文作者

Sahiner, Arda, Ergen, Tolga, Ozturkler, Batu, Pauly, John, Mardani, Morteza, Pilanci, Mert

论文摘要

使用自我注意力或其提议的替代方案的视觉变压器在许多相关任务中表现出了有希望的结果。但是，尚未充分了解关注的基础归纳偏见。为了解决这个问题，本文通过凸双重性的角度分析了注意力。对于非线性点产生自我注意力，以及MLP混合和傅立叶神经操作员（FNO）等替代机制，我们得出了同等的有限维凸问题，这些问题可解释且可解决为全球最佳性。凸面程序导致{\ it阻断核定型正则化}，该核定范围}在潜在特征和代币维度中促进了较低的等级。特别是，我们展示了自我发场网络如何根据其潜在的相似性隐式将令牌分类。我们通过微调各种凸注意力头来进行转移预训练的变压器主链进行CIFAR-100分类的实验。结果表明，与现有的MLP或线性头部相比，注意引起的偏差的优点。

Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.

下载PDF全文

下载文献需遵守相关版权规定

论文标题