论文标题
视力变压器的集体广义平均集合
Group Generalized Mean Pooling for Vision Transformer
论文作者
论文摘要
视觉变压器(VIT)按照自然语言处理(NLP)或计算机视觉中的自然语言处理(NLP)或卷积神经网络(CNN)的结构来提取所有贴片令牌的最终表示。但是,对汇总贴片令牌的最佳方法的研究仍然仅限于平均合并,而可以考虑广泛使用的汇总策略,例如Max和Gem Pooling。尽管它们有效,但现有的合并策略并未考虑VIT的架构以及激活图的频道差异,而是将关键和琐碎的渠道汇总为具有相同重要性。在本文中,我们将集体广义(GGEM)汇总为VIT的简单而强大的合并策略。 GGEM将通道分为组,并使用每组共享池参数计算GEM池。随着VIT通过多头注意机制将通道分组,通过GGEM将通道分组会导致较低的头视力依赖性,同时放大激活图上的重要通道。与基准相比,利用GGEM显示出0.1%P至0.7%P性能提高,并在Imagenet-1K分类任务中实现了VIT-BASE和VIT-LARGE模型的最先进性能。此外,GGEM优于图像检索和多模式表示任务上现有的合并策略,证明了GGEM对各种任务的优越性。 GGEM是一种简单的算法,因为仅需要几行代码才能实现。
Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.