使用深层先验的多物体3D场景分解的弱监督学习

论文标题

使用深层先验的多物体3D场景分解的弱监督学习

Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors

论文作者

Elich, Cathrin, Oswald, Martin R., Pollefeys, Marc, Stueckler, Joerg

论文摘要

在物体的粒度上代表场景是场景理解和决策的先决条件。我们提出了Prismonet，这是一种基于先前形状知识的新方法，用于学习多对象3D场景分解和来自单个图像的表示。我们的方法学会了将平面表面上多个对象的合成场景的图像分解成其组成场景对象，并从单个视图中推断出其3D属性。复发的编码器从输入RGB图像中回归了每个对象的3D形状，姿势和纹理的潜在表示。通过可区分的渲染，我们训练模型以一种自我监督的方式从RGB-D图像中分解场景。 3D形状在功能空间中连续表示为符号距离函数，我们以监督的方式从示例形状中预先训练。这些形状先验提供了弱的监督信号，以更好地调节具有挑战性的整体学习任务。我们评估了模型在推断3D场景布局中的准确性，展示其生成能力，评估其对真实图像的概括，并指出学习成的表现的好处。

Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose PriSMONet, a novel approach based on Prior Shape knowledge for learning Multi-Object 3D scene decomposition and representations from single images. Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. By differentiable rendering, we train our model to decompose scenes from RGB-D images in a self-supervised way. The 3D shapes are represented continuously in function-space as signed distance functions which we pre-train from example shapes in a supervised way. These shape priors provide weak supervision signals to better condition the challenging overall learning task. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题