通过选择性蒸馏立体声知识学习单眼深度估计

论文标题

通过选择性蒸馏立体声知识学习单眼深度估计

Learning Monocular Depth Estimation via Selective Distillation of Stereo Knowledge

论文作者

Song, Kyeongseob, Yoon, Kuk-Jin

论文摘要

基于深度学习对单眼深度估计进行了广泛的探索，但其准确性和概括能力仍然落后于基于立体声的方法。为了解决这个问题，最近的一些研究提出了通过将差异图作为代理地面真实性提炼出来，以监督单眼深度估计网络。但是，这些研究天真地提炼了立体声知识，而无需考虑基于立体声和单眼深度估计方法的比较优势。在本文中，我们建议选择性地提炼差异图，以进行更可靠的代理监督。具体而言，我们首先设计了一个解码器（MaskDecoder），该解码器学习了两个二进制掩码，这些二进制面膜经过训练，可以在代理差异映射和每个像素的估计深度图之间进行最佳选择。然后将学习的口罩喂给另一个解码器（DepthDecoder），以强制执行估计的深度，以从代理差异图中的蒙版区域学习。此外，教师学生的模块旨在将立体声的几何知识转移到Mononet。广泛的实验验证了我们的方法实现了对Kitti数据集的自我和代理单眼估计的最新性能，甚至超过了一些半监督方法。

Monocular depth estimation has been extensively explored based on deep learning, yet its accuracy and generalization ability still lag far behind the stereo-based methods. To tackle this, a few recent studies have proposed to supervise the monocular depth estimation network by distilling disparity maps as proxy ground-truths. However, these studies naively distill the stereo knowledge without considering the comparative advantages of stereo-based and monocular depth estimation methods. In this paper, we propose to selectively distill the disparity maps for more reliable proxy supervision. Specifically, we first design a decoder (MaskDecoder) that learns two binary masks which are trained to choose optimally between the proxy disparity maps and the estimated depth maps for each pixel. The learned masks are then fed to another decoder (DepthDecoder) to enforce the estimated depths to learn from only the masked area in the proxy disparity maps. Additionally, a Teacher-Student module is designed to transfer the geometric knowledge of the StereoNet to the MonoNet. Extensive experiments validate our methods achieve state-of-the-art performance for self- and proxy-supervised monocular depth estimation on the KITTI dataset, even surpassing some of the semi-supervised methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题