ContextClip：剪辑视觉表示上图像文本对的上下来对齐

论文标题

ContextClip：剪辑视觉表示上图像文本对的上下来对齐

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

论文作者

Grover, Chanda, Mastan, Indra Deep, Gupta, Debayan

论文摘要

最先进的经验工作表明，深层神经网络学到的视觉表示本质上是强大的，并且能够在不同数据集上执行分类任务。例如，剪辑在多个数据集上演示了零射击传输性能，以在图像和文本对的关节嵌入空间中进行分类任务。但是，它在标准数据集上显示出负转移性能，例如birdsNap，resisc45和mnist。在本文中，我们提出了ContextClip，这是一个上下文和对比的学习框架，用于通过在概念字幕数据集中学习强大的视觉表示图像对对齐的上下文对齐。观察到我们的框架是通过在关节嵌入空间中上下文对齐文本和图像表示来改善图像文本对齐。 ContextClip在文本到图像检索任务和增强的分类准确性方面表现出良好的定性性能。我们对CIFAR-10，CIFAR-100，BIRDSNAP，RESISC45和MNIST数据集进行了定量评估模型，并进行了零摄像转移和微调实验。

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust in nature and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer performance on multiple datasets for classification tasks in a joint embedding space of image and text pairs. However, it showed negative transfer performance on standard datasets, e.g., BirdsNAP, RESISC45, and MNIST. In this paper, we propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs by learning robust visual representations on Conceptual Captions dataset. Our framework was observed to improve the image-text alignment by aligning text and image representations contextually in the joint embedding space. ContextCLIP showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy. We evaluated our model quantitatively with zero-shot transfer and fine-tuning experiments on CIFAR-10, CIFAR-100, Birdsnap, RESISC45, and MNIST datasets for classification task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题