论文标题
密集的关系图像通过多任务三重流网络字幕
Dense Relational Image Captioning via Multi-task Triple-Stream Networks
论文作者
论文摘要
我们介绍了密集的关系字幕,这是一个新颖的图像字幕任务,旨在生成有关视觉场景中对象之间关系信息的多个字幕。关系字幕为对象组合之间的每个关系提供明确的描述。该框架在多样性和信息量中都是有利的,从而导致基于关系(例如关系生成的关系)的全面图像理解。对于对象之间的关系理解,语音的一部分(即主题 - 对象 - 明显类别)可能是指导字幕中单词因果序列的有价值的先前信息。我们强制执行我们的框架学习不仅要生成字幕,而且要了解每个单词的pos。为此,我们提出了多任务三个流网络(MTTSNET),该网络由三个重复单元组成,负责每个POS,该单元通过共同预测每个单词的正确标题和POS来训练。此外,我们发现可以通过使用明确的关系模块调节对象嵌入来提高MTTSNET的性能。我们证明,通过大规模数据集和几个指标,我们提出的模型可以生成更多样化和更丰富的标题。然后,我们向整体图像字幕,场景图生成和检索任务提供了框架的应用。
We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions for each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS; i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to learn not only to generate captions but also to understand the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. Then, we present applications of our framework to holistic image captioning, scene graph generation, and retrieval tasks.