论文标题
最后一个隐藏层激活的不合理有效性,以实现对抗性鲁棒性
Unreasonable Effectiveness of Last Hidden Layer Activations for Adversarial Robustness
论文作者
论文摘要
在基于标准的深神经网络(DNN)分类器中,一般惯例是省略最后(输出)层中的激活函数,并直接在逻辑上应用SoftMax函数,以获取每个类的概率分数。在这种类型的体系结构中,分类器对任何输出类的损耗值与最终概率得分与关联类的标签值之间的差成正比。标准的白盒对抗逃避攻击,无论是针对性还是不靶向,主要是试图利用模型损失函数的梯度来制作对抗性样本并欺骗模型。在这项研究中,我们在数学上和实验上都表明,使用高温值的模型输出层中一些众所周知的激活函数具有将目标和未靶向攻击案例的梯度归零的效果,从而防止攻击者利用模型的损失函数来制作对抗性样品。我们已经通过实验验证了方法对MNIST(数字),CIFAR10数据集的功效。详细的实验证实,我们的方法基本上改善了针对基于梯度的针对性和无目标攻击威胁的鲁棒性。而且,我们表明,在输出层上增加的非线性在其他某些攻击方法(例如DeepFool Attack)中具有一些其他好处。
In standard Deep Neural Network (DNN) based classifiers, the general convention is to omit the activation function in the last (output) layer and directly apply the softmax function on the logits to get the probability scores of each class. In this type of architectures, the loss value of the classifier against any output class is directly proportional to the difference between the final probability score and the label value of the associated class. Standard White-box adversarial evasion attacks, whether targeted or untargeted, mainly try to exploit the gradient of the model loss function to craft adversarial samples and fool the model. In this study, we show both mathematically and experimentally that using some widely known activation functions in the output layer of the model with high temperature values has the effect of zeroing out the gradients for both targeted and untargeted attack cases, preventing attackers from exploiting the model's loss function to craft adversarial samples. We've experimentally verified the efficacy of our approach on MNIST (Digit), CIFAR10 datasets. Detailed experiments confirmed that our approach substantially improves robustness against gradient-based targeted and untargeted attack threats. And, we showed that the increased non-linearity at the output layer has some additional benefits against some other attack methods like Deepfool attack.