用上下文对象分开的潜在空间的多种图像字幕

论文标题

用上下文对象分开的潜在空间的多种图像字幕

Diverse Image Captioning with Context-Object Split Latent Spaces

论文作者

Mahajan, Shweta, Roth, Stefan

论文摘要

各种图像字幕模型旨在学习与跨域数据集的一对多映射，例如图像和文本。此任务的当前方法基于生成潜在变量模型，例如带有结构性潜在空间的VAE。但是，先前工作捕获的多模式的数量仅限于配对训练数据的数量 - 基本生成过程的真实多样性并未完全捕获。为了解决此限制，我们利用数据集中的上下文描述来解释不同视觉场景中类似上下文的情况。为此，我们介绍了一种被称为上下文对象拆分的潜在空间的新颖分解，以模拟数据集中图像和文本的上下文描述中的多样性。我们的框架不仅可以通过基于上下文的伪监督来实现各种字幕，而且可以将其扩展到具有新颖对象的图像，并且在培训数据中没有配对的字幕。我们在标准可可数据集上评估了我们的COS-CVAE方法，以及由具有新物体的图像组成的固定可可数据集，显示出精确性和多样性的显着提高。

Diverse image captioning models aim to learn one-to-many mappings that are innate to cross-domain datasets, such as of images and texts. Current methods for this task are based on generative latent variable models, e.g. VAEs with structured latent spaces. Yet, the amount of multimodality captured by prior work is limited to that of the paired training data -- the true diversity of the underlying generative process is not fully captured. To address this limitation, we leverage the contextual descriptions in the dataset that explain similar contexts in different visual scenes. To this end, we introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts within the dataset. Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data. We evaluate our COS-CVAE approach on the standard COCO dataset and on the held-out COCO dataset consisting of images with novel objects, showing significant gains in accuracy and diversity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题