SVL-AUPAPTER：视觉审计模型的自我监督适配器

论文标题

SVL-AUPAPTER：视觉审计模型的自我监督适配器

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

论文作者

Pantazis, Omiros, Brostow, Gabriel, Jones, Kate, Mac Aodha, Oisin

论文摘要

视觉语言模型（例如剪辑）在大量的互联网采购图像和文本对上进行了预估计，并已显示出有时表现出令人印象深刻的零和低弹射图像分类性能。但是，由于其尺寸，在所需的监督和计算方面，在新数据集上对这些模型进行微调可能非常昂贵。为了解决这个问题，已经提出了一系列轻巧的适应方法，以便在有限的监督下有效地适应此类模型。在这项工作中，我们表明，虽然在互联网风格的数据集上有效，但即使是对分类任务的措施，这些疗法的图像与通常在网上发现的图像差异很大。为了解决这个问题，我们提出了一种称为SVL-AUPAPTER的新方法，该方法结合了视觉训练和自我监督的表示学习的互补优势。与现有方法相比，在一组具有挑战性的视觉分类任务上，我们报告平均分类精度在低弹射设置中提高了10％。此外，我们提出了一种为模型选择重要的混合超参数的全自动方法，该方法不需要任何固定标记的验证数据。我们项目的代码可在此处提供：https：//github.com/omipan/svl_adapter。

Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zero- and low-shot image classification performance. However, due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. To combat this, a series of light-weight adaptation methods have been proposed to efficiently adapt such models when limited supervision is available. In this work, we show that while effective on internet-style datasets, even those remedies under-deliver on classification tasks with images that differ significantly from those commonly found online. To address this issue, we present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning. We report an average classification accuracy improvement of 10% in the low-shot setting when compared to existing methods, on a set of challenging visual classification tasks. Further, we present a fully automatic way of selecting an important blending hyperparameter for our model that does not require any held-out labeled validation data. Code for our project is available here: https://github.com/omipan/svl_adapter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题