具有生成变压器的迭代场景图生成

论文标题

具有生成变压器的迭代场景图生成

Iterative Scene Graph Generation with Generative Transformers

论文作者

Kundu, Sanjoy, Aakur, Sathyanarayanan N.

论文摘要

场景图通过以图形格式编码实体（对象）及其空间关系，为场景提供了丰富的结构化表示。事实证明，该表示形式在几个任务中有用，例如回答，字幕甚至对象检测，仅举几例。当前的方法采用一种逐个分类的方法，其中场景图是通过场景中对象之间所有可能边缘的标记生成的，从而为方法添加了计算开销。这项工作介绍了一种基于生成变压器的方法，以生成超出链接预测的场景图。使用两个基于变压器的组件，我们首先从检测到的对象及其视觉特征采样了可能的场景图结构。然后，我们在采样边缘上执行谓词分类以生成最终场景图。这种方法使我们能够从最小推理开销的图像中有效地生成场景图。视觉基因组数据集的广泛实验证明了所提出的方法的效率。如果没有铃铛和哨声，我们平均在不同的设置中获得了20.7％的平均召回（MR@100），以进行场景图（SGG），表现优于最先进的SGG方法，同时为无偏见的SGG方法提供竞争性能。

Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题