caponimage：上下文驱动图像上的密集捕获

论文标题

caponimage：上下文驱动图像上的密集捕获

CapOnImage: Context-driven Dense-Captioning on Image

论文作者

Gao, Yiqi, Hou, Xinglin, Zhang, Yuanmeng, Ge, Tiezheng, Jiang, Yuning, Wang, Peng

论文摘要

现有的图像字幕系统致力于为图像生成叙事字幕，这些字幕在演示文稿中与图像脱离。但是，文本也可以用作图像上的装饰，以突出关键点并提高图像的吸引力。在这项工作中，我们介绍了一项名为“图像”字幕（CaponImage）的新任务，该任务旨在根据上下文信息在图像的不同位置生成密集的字幕。为了充分利用周围的视觉上下文以生成每个位置的最合适的字幕，我们提出了一个多模式的预训练模型，该模型具有多级预训练的任务，该模型逐渐从易于到困难学习文本和图像位置之间的对应关系。由于该模型可能会为附近位置生成冗余字幕，因此我们进一步增强了与邻居位置作为上下文嵌入的位置。对于这项新任务，我们还引入了一个称为Caponimage2M的大规模基准，其中包含210万个产品图像，每个产品图像平均为4.8个空间局部的字幕。与其他图像字幕模型变体相比，我们的模型在字幕准确性和多样性方面都取得了最佳结果。我们将公开代码和数据集以促进未来的研究。

Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key points and increase the attractiveness of images. In this work, we introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. To fully exploit the surrounding visual context to generate the most suitable caption for each location, we propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations from easy to difficult. Since the model may generate redundant captions for nearby locations, we further enhance the location embedding with neighbor locations as context. For this new task, we also introduce a large-scale benchmark called CapOnImage2M, which contains 2.1 million product images, each with an average of 4.8 spatially localized captions. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects. We will make code and datasets public to facilitate future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题