论文标题
从场景图转换图像生成
Transforming Image Generation from Scene Graphs
论文作者
论文摘要
从语义视觉知识中生成图像是一项具有挑战性的任务,与诸如类标签或文本说明之类的替代方案相比,以复杂,微妙和明确的方式调节合成过程很有用。尽管存在以语义表示为条件的生成方法,但除了对对象之间的约束规范外,它们没有提供控制生成过程的方法。例如,迭代生成或修改图像通过手动添加特定项目的可能性是所需的属性,据我们所知,文献尚未在文献中得到充分研究。在这项工作中,我们提出了一种基于变压器的方法,该方法以场景图为条件,相反,该方法针对最近的基于变压器的方法,还采用解码器来自动构图构成图像,从而使合成过程更加有效和可控。所提出的体系结构由三个模块组成:1)图形卷积网络,以编码输入图的关系; 2)编码器 - 码头变压器,该变压器可自动加入构成输出图像; 3)一种自动编码器,用于生成用作变压器每个生成步骤的输入/输出的表示。在CIFAR10和MNIST图像上获得的结果表明,我们的模型能够满足由场景图定义的语义约束,并通过考虑到所需目标的用户提供的部分渲染,以模拟场景中的视觉对象之间的关系。
Generating images from semantic visual knowledge is a challenging task, that can be useful to condition the synthesis process in complex, subtle, and unambiguous ways, compared to alternatives such as class labels or text descriptions. Although generative methods conditioned by semantic representations exist, they do not provide a way to control the generation process aside from the specification of constraints between objects. As an example, the possibility to iteratively generate or modify images by manually adding specific items is a desired property that, to our knowledge, has not been fully investigated in the literature. In this work we propose a transformer-based approach conditioned by scene graphs that, conversely to recent transformer-based methods, also employs a decoder to autoregressively compose images, making the synthesis process more effective and controllable. The proposed architecture is composed by three modules: 1) a graph convolutional network, to encode the relationships of the input graph; 2) an encoder-decoder transformer, which autoregressively composes the output image; 3) an auto-encoder, employed to generate representations used as input/output of each generation step by the transformer. Results obtained on CIFAR10 and MNIST images show that our model is able to satisfy semantic constraints defined by a scene graph and to model relations between visual objects in the scene by taking into account a user-provided partial rendering of the desired target.