我不敢相信没有图像！仅使用语言监督学习视觉任务

论文标题

我不敢相信没有图像！仅使用语言监督学习视觉任务

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

论文作者

Gu, Sophia, Clark, Christopher, Kembhavi, Aniruddha

论文摘要

计算机视觉任务所需的许多高级技能，例如解析问题，比较和对比语义以及编写描述，在其他领域（例如自然语言处理）也需要。在本文中，我们询问是否可以从文本数据中学习这些技能，然后将其转移到视觉任务，而无需对视觉训练数据进行培训。我们方法的关键是利用对比训练有素的视觉和语言编码者的联合嵌入空间。实际上，在对比模型中，不同方式的嵌入空间之间可能存在系统的差异，我们分析了这些差异如何影响我们的方法和研究策略以减轻这种关注。我们仅使用有关四个代表性任务的文本培训数据制作模型：图像字幕，视觉范围，视觉问题回答和视觉新闻字幕，并使用图像在标准基准测试上对其进行评估。我们发现这些模型在图像上训练的模型附近执行，同时超过了本文本设置的先前工作，以供字幕和视觉范围内进行9点以上，并优于30点以上的视觉新闻上的所有先前工作。我们还展示了各种风格图像字幕模型，这些模型是使用没有图像数据和没有人类策划语言数据的训练的，而是使用书籍，网络或语言模型的易于获取的文本数据。

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from text data and then transfer them to vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study strategies to mitigate this concern. We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images. We find these models perform close to models trained on images, while surpassing prior work for captioning and visual entailment in this text-only setting by over 9 points, and outperforming all prior work on visual news by over 30 points. We also showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data, but instead using readily-available text data from books, the web, or language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题