增量调整：对预训练语言模型的参数有效方法的全面研究

论文标题

增量调整：对预训练语言模型的参数有效方法的全面研究

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

论文作者

Ding, Ning, Qin, Yujia, Yang, Guang, Wei, Fuchao, Yang, Zonghan, Su, Yusheng, Hu, Shengding, Chen, Yulin, Chan, Chi-Min, Chen, Weize, Yi, Jing, Zhao, Weilin, Wang, Xiaozhi, Liu, Zhiyuan, Zheng, Hai-Tao, Chen, Jianfei, Liu, Yang, Tang, Jie, Li, Juanzi, Sun, Maosong

论文摘要

尽管取得了成功，但对大规模PLM的微调过程带来了过度的适应成本。实际上，微调巨大模型的所有参数并保留不同任务的单独实例实际上是不可行的。这需要一个新的研究分支，重点是PLM的参数有效适应，本文被称为Delta Tuning。与标准的微调相反，三角洲调整仅微调模型参数的一小部分，同时保持其余的未触及，从而大大降低了计算和存储成本。最近的研究表明，具有独特调谐参数选择的一系列三角调谐方法可以在全参数微调的情况下达到性能，这表明一种刺激大型PLM的新的有希望的方法。在本文中，我们首先正式描述了三角洲调整的问题，然后全面回顾了最新的增量调整方法。我们还提出了一个统一的分类标准，该标准将现有的增量调整方法分为三组：基于加法的，基于规范和基于重复的方法。尽管最初提议作为引导大型模型的有效方法，但我们认为，发现的一些有趣的证据以及三角洲调整可以帮助进一步揭示PLM甚至深度神经网络的机制。为此，我们讨论了分别从优化和最佳控制的角度来解释三角洲调整有效性的理论原则。此外，我们提供了代表性方法的整体经验研究，其中超过100个NLP任务的结果表明了不同方法的全面性能比较。实验结果还涵盖了Delta调整的组合，缩放和可转移特性的分析。

Despite the success, the process of fine-tuning large-scale PLMs brings prohibitive adaptation costs. In fact, fine-tuning all the parameters of a colossal model and retaining separate instances for different tasks are practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, dubbed as delta tuning in this paper. In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched, largely reducing both the computation and storage costs. Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full-parameter fine-tuning, suggesting a new promising way of stimulating large-scale PLMs. In this paper, we first formally describe the problem of delta tuning and then comprehensively review recent delta tuning approaches. We also propose a unified categorization criterion that divide existing delta tuning methods into three groups: addition-based, specification-based, and reparameterization-based methods. Though initially proposed as an efficient method to steer large models, we believe that some of the fascinating evidence discovered along with delta tuning could help further reveal the mechanisms of PLMs and even deep neural networks. To this end, we discuss the theoretical principles underlying the effectiveness of delta tuning and propose frameworks to interpret delta tuning from the perspective of optimization and optimal control, respectively. Furthermore, we provide a holistic empirical study of representative methods, where results on over 100 NLP tasks demonstrate a comprehensive performance comparison of different approaches. The experimental results also cover the analysis of combinatorial, scaling and transferable properties of delta tuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题