论文标题
抓挠视觉变压器的背部均匀注意
Scratching Visual Transformer's Back with Uniform Attention
论文作者
论文摘要
视觉变压器(VIT)的有利性能通常归因于多头自我注意力(MSA)。 MSA可以在VIT模型的每一层上实现全局相互作用,这是与卷积神经网络(CNN)的对比特征,该特征逐渐增加了多个层的相互作用范围。我们研究注意力的密度的作用。我们的初步分析表明,注意图的空间相互作用接近密集的相互作用,而不是稀疏相互作用。这是一个奇怪的现象,因为由于周围更陡峭的软呈梯度,该模型很难学习。我们将其解释为对VIT模型包括密集相互作用的强烈偏爱。因此,我们将统一的注意力插入了VIT模型的每一层,以提供急需的密集相互作用。我们称此方法上下文广播,CB。我们观察到,包含CB会降低原始注意图中的密度程度,并提高VIT模型的容量和概括性。 CB造成的成本可忽略不计:模型代码中的1行,没有其他参数和最少的额外操作。
The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers. We study the role of the density of the attention. Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them. We interpret this as a strong preference for ViT models to include dense interaction. We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions. We call this method Context Broadcasting, CB. We observe that the inclusion of CB reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models. CB incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.