论文标题
通过Cauchy问题了解视觉变压器的对抗性鲁棒性
Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem
论文作者
论文摘要
关于深度学习的鲁棒性的最新研究表明,视觉变形金刚(VITS)在某些扰动下超过了卷积神经网络(CNN),例如自然腐败,对抗性攻击等。一些论文认为,VIT的出色鲁棒性来自其输入图像的分割。其他人则说,多头自我注意力(MSA)是保持鲁棒性的关键。在本文中,我们旨在引入一个原则和统一的理论框架,以调查有关VIT稳健性的这种论点。我们首先从理论上证明,与自然语言处理中的变压器不同,VIT是Lipschitz的连续。然后,我们从理论上分析了VIT的对抗性鲁棒性,从库奇问题的角度来看,我们可以量化鲁棒性如何通过层传播。我们证明了第一层也是最后一层是影响VIT鲁棒性的关键因素。此外,根据我们的理论,我们从经验上表明,与现有研究的主张不同,MSA仅有助于VIT在弱的对抗性攻击下的对抗性鲁棒性,例如FGSM和令人惊讶的是,MSA实际上构成了该模型在强力攻击下的对抗性鲁棒性,例如,PGD攻击,PGD攻击,PGD攻击。
Recent research on the robustness of deep learning has shown that Vision Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers argue that the superior robustness of ViT comes from the segmentation of its input images; others say that the Multi-head Self-Attention (MSA) is the key to preserving the robustness. In this paper, we aim to introduce a principled and unified theoretical framework to investigate such an argument on ViT's robustness. We first theoretically prove that, unlike Transformers in Natural Language Processing, ViTs are Lipschitz continuous. Then we theoretically analyze the adversarial robustness of ViTs from the perspective of the Cauchy Problem, via which we can quantify how the robustness propagates through layers. We demonstrate that the first and last layers are the critical factors to affect the robustness of ViTs. Furthermore, based on our theory, we empirically show that unlike the claims from existing research, MSA only contributes to the adversarial robustness of ViTs under weak adversarial attacks, e.g., FGSM, and surprisingly, MSA actually comprises the model's adversarial robustness under stronger attacks, e.g., PGD attacks.