论文标题
位置嵌入需要独立的层归一化
Position Embedding Needs an Independent Layer Normalization
论文作者
论文摘要
由于自我注意操作的置换率,嵌入的位置嵌入(PE)对于视觉变压器(VT)至关重要。通过使用重新聚集和可视化分析VT中每个编码层的输入和输出,我们发现默认的PE连接方法(简单地将PE和PAITS嵌入在一起)可以操作相同的仿射转换到代币嵌入和PE,从而限制了PE的表现,从而限制了VTS的性能。为了克服这一限制,我们提出了一种简单,有效且健壮的方法。具体而言,我们为令牌嵌入和PE提供了两个独立的层正常化,并为每一层提供了将它们添加在一起,并将它们作为每个层的Muti-Head自我发项模块的输入。由于该方法允许该模型适应为不同层的PE信息,因此我们将其称为层自适应位置嵌入,缩写为Lape。广泛的实验表明,Lape可以通过不同类型的PE改善各种VT,并使VT稳健地对PE类型进行稳健。例如,LAPE提高了CIFAR10上VIT-LITE的0.94%的精度,CCT在CIFAR100上提高了0.98%,ImagEnet-1k上的DEIT提高了1.72%,考虑到Lape带来的额外参数,内存和计算成本可忽略不计。该代码可在https://github.com/ingrid725/lape上公开获取。
The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position Embedding, abbreviated as LaPE. Extensive experiments demonstrate that LaPE can improve various VTs with different types of PE and make VTs robust to PE types. For example, LaPE improves 0.94% accuracy for ViT-Lite on Cifar10, 0.98% for CCT on Cifar100, and 1.72% for DeiT on ImageNet-1K, which is remarkable considering the negligible extra parameters, memory and computational cost brought by LaPE. The code is publicly available at https://github.com/Ingrid725/LaPE.