自动形式：适应视觉变压器以进行可扩展的视觉识别

论文标题

自动形式：适应视觉变压器以进行可扩展的视觉识别

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

论文作者

Chen, Shoufa, Ge, Chongjian, Tong, Zhan, Wang, Jiangliu, Song, Yibing, Wang, Jue, Luo, Ping

论文摘要

训练有素的视觉变压器（VIT）在视觉识别方面取得了巨大的成功。以下方案是将VIT适应各种图像和视频识别任务。由于重大计算和存储器存储，因此适应性挑战。每个模型都需要一个独立而完整的列出过程来适应不同的任务，从而将其转移性限制为不同的视觉域。为了应对这一挑战，我们为变压器提出了一种有效的适应方法，即自适应形式，该方法可以将预训练的VIT适应许多不同的图像和视频任务。它具有比以前的艺术更具吸引力的好处。首先，AudaptFormer引入了轻巧的模块，该模块仅在VIT中增加少于2％的额外参数，而在不更新其原始预训练的参数的情况下，它可以提高VIT的可传递性，从而在动作识别基准上大大优于现有的100 \％完全微调的模型。其次，它可以在不同的变压器中插入插件，并且可以扩展到许多视觉任务。第三，在五个图像和视频数据集上进行了广泛的实验表明，自适应形式在很大程度上改善了目标域中的VIT。例如，与完全微调的模型相比，当仅更新1.5％的额外参数时，相对改进约为10％和19％。代码可在https://github.com/shoufachen/adaptformer上找到。

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100\% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Code is available at https://github.com/ShoufaChen/AdaptFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题