巨型预验证的图像模型可以提取通用表示形式吗？

论文标题

巨型预验证的图像模型可以提取通用表示形式吗？

Could Giant Pretrained Image Models Extract Universal Representations?

论文作者

Lin, Yutong, Liu, Ze, Zhang, Zheng, Hu, Han, Zheng, Nanning, Lin, Stephen, Cao, Yue

论文摘要

冻结的预处理模型已成为用于转移学习的预处理范式的可行替代方法。但是，使用冷冻模型，很少有参数可用于适应下游任务，这在计算机视觉中是有问题的，在计算机视觉中，任务的输入/输出格式和有价值的信息类型都有很大差异。在本文中，我们介绍了一项针对冻结的验证模型的研究，当应用于多样化和代表性的计算机视觉任务，包括对象检测，语义细分和视频动作识别。从这项经验分析中，我们的工作回答了训练预处理最适合这种冷冻设置的问题，如何使冷冻设置更加灵活地适合各种下游任务以及较大的模型大小的效果。 We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on动力学-400动作识别。通过这项工作，我们希望更加关注这种有前途的冻结预验证的图像模型的途径。

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题