测试时间及时调整视觉模型中的零弹性概括

论文标题

测试时间及时调整视觉模型中的零弹性概括

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

论文作者

Shu, Manli, Nie, Weili, Huang, De-An, Yu, Zhiding, Goldstein, Tom, Anandkumar, Anima, Xiao, Chaowei

论文摘要

预训练的视觉模型（例如，剪辑）在许多下游任务中显示出有希望的零弹性概括，并具有正确设计的文本提示。最近的作品不依靠手工设计的提示，而是使用下游任务的培训数据来学习提示。在有效的情况下，对域特异性数据进行培训可降低模型的概括能力，使其成为未见的新领域。在这项工作中，我们提出了测试时间提示（TPT），该方法可以通过单个测试样本即时学习自适应提示。对于图像分类，TPT通过使用置信度选择最小化熵来优化提示，以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时，TPT平均将零击的TOP-1精度提高了3.6％，超过了先前需要其他特定于任务的培训数据的迅速调整方法。在用看不见的类别评估跨数据集泛化时，TPT与使用其他培训数据的最先进方法相当。项目页面：https：//azshue.github.io/tpt。

Pre-trained vision-language models (e.g., CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of relying on hand-engineered prompts, recent works learn prompts using the training data from downstream tasks. While effective, training on domain-specific data reduces a model's generalization capability to unseen new domains. In this work, we propose test-time prompt tuning (TPT), a method that can learn adaptive prompts on the fly with a single test sample. For image classification, TPT optimizes the prompt by minimizing the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample. In evaluating generalization to natural distribution shifts, TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average, surpassing previous prompt tuning approaches that require additional task-specific training data. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data. Project page: https://azshue.github.io/TPT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题