论文标题
学习预测场景的3D布局
Learning to Predict the 3D Layout of a Scene
论文作者
论文摘要
尽管过去的2D对象检测在过去有了显着改善,但现实世界中的计算机视觉应用通常需要了解场景的3D布局。 3D检测的许多最新方法都使用激光点云进行预测。我们提出了一种仅使用单个RGB图像的方法,从而在没有LiDAR传感器的设备或车辆中启用应用程序。通过使用RGB图像,我们可以通过扩展具有3D检测头的2D检测器来利用最近2D对象检测器的成熟度和成功。在本文中,我们讨论了设计3D检测头的不同方法和实验,包括回归和分类方法。此外,我们评估了子问题和实施细节如何影响总体预测结果。我们使用Kitti数据集进行培训,其中包括带有班级标签,2D边界盒和3D注释的街头交通场景,具有七个自由度。我们的最终架构基于更快的R-CNN。卷积主干的输出是每个感兴趣区域的固定尺寸特征图。然后在网络头内的完全连接的图层提出一个对象类,并执行2D边界框回归。我们通过一个3D检测标题扩展了网络头,该数据可以通过分类来预测3D边界框的各种自由度。根据官方Kitti基准的要求,我们的平均平均精度为47.3%,在3D相交的联合阈值上,在3D交叉点上测量了70%;优于先前最先进的单一RGB的方法仅通过很大的边距。
While 2D object detection has improved significantly over the past, real world applications of computer vision often require an understanding of the 3D layout of a scene. Many recent approaches to 3D detection use LiDAR point clouds for prediction. We propose a method that only uses a single RGB image, thus enabling applications in devices or vehicles that do not have LiDAR sensors. By using an RGB image, we can leverage the maturity and success of recent 2D object detectors, by extending a 2D detector with a 3D detection head. In this paper we discuss different approaches and experiments, including both regression and classification methods, for designing this 3D detection head. Furthermore, we evaluate how subproblems and implementation details impact the overall prediction result. We use the KITTI dataset for training, which consists of street traffic scenes with class labels, 2D bounding boxes and 3D annotations with seven degrees of freedom. Our final architecture is based on Faster R-CNN. The outputs of the convolutional backbone are fixed sized feature maps for every region of interest. Fully connected layers within the network head then propose an object class and perform 2D bounding box regression. We extend the network head by a 3D detection head, which predicts every degree of freedom of a 3D bounding box via classification. We achieve a mean average precision of 47.3% for moderately difficult data, measured at a 3D intersection over union threshold of 70%, as required by the official KITTI benchmark; outperforming previous state-of-the-art single RGB only methods by a large margin.