迈向视觉参数有效传输学习的统一观点

论文标题

迈向视觉参数有效传输学习的统一观点

Towards a Unified View on Visual Parameter-Efficient Transfer Learning

论文作者

Yu, Bruce X. B., Chang, Jianlong, Liu, Lingbo, Tian, Qi, Chen, Chang Wen

论文摘要

参数有效传输学习（PETL）旨在通过微调少量参数来充分利用预训练的大型模型中的表示知识。最近，从自然语言处理（NLP）域中汲取灵感，诸如及时调整和适配器等流行的PETL技术也已成功地应用于视觉领域。但是，对于视力任务，前缀调整仍未探索。在这项工作中，我们打算将大型视觉模型（LVM）调整为具有良好参数 - 准确权衡权衡的下游任务。为了实现这一目标，我们提出了一个框架，具有统一的PETL视图，称为Visual-Petl（V-Petl），以研究不同的PETL技术，下游域的数据量表，可训练参数的位置以及影响权衡的其他方面的效果。具体而言，我们分析了可训练参数的位置重要性以及NLP和视力任务之间在数据结构和预训练机制方面的差异，同时实施了各种PETL技术，尤其是对于未经探索的前缀调用技术而言。基于对NLP和视觉数据之间差异的全面理解，我们提出了一种称为“平行注意”的前缀调用模块的新变化，以实现下游任务。通过不同的冷冻LVM对视力任务进行了广泛的经验分析，发现结果表明，所提出的PATT可以有效地有助于其他PETL技术。从建议的V-PETL框架中得出的有效方案SWIN-BAPAT比最先进的AdaptFormer-Swin具有更高的参数，并且胜过更少的参数的最先进的Audaptformer-Swin的性能要好得多。代码和数据可在以下网址获得：https：//github.com/bruceyo/v-petl。

Parameter efficient transfer learning (PETL) aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, taking inspiration from the natural language processing (NLP) domain, popular PETL techniques such as prompt-tuning and Adapter have also been successfully applied to the vision domain. However, prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large vision models (LVMs) to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of the differences between NLP and vision data, we propose a new variation of the prefix-tuning module called parallel attention (PATT) for vision downstream tasks. An extensive empirical analysis on vision tasks via different frozen LVMs has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far fewer parameters. Code and data are available at: https://github.com/bruceyo/V-PETL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题