置换不变的成本计算，以进行有效的立体声匹配

论文标题

置换不变的成本计算，以进行有效的立体声匹配

Displacement-Invariant Cost Computation for Efficient Stereo Matching

论文作者

Zhong, Yiran, Loop, Charles, Byeon, Wonmin, Birchfield, Stan, Dai, Yuchao, Zhang, Kaihao, Kamenev, Alexey, Breuel, Thomas, Li, Hongdong, Kautz, Jan

论文摘要

尽管基于深度学习的方法通过产生前所未有的差异准确性来统治立体声匹配的排行榜，但其推理时间通常很慢，在一对540p图像的秒数上，秒数。主要原因是领先的方法采用了适用于4D特征量的耗时的3D卷积。加快计算加快计算的一种常见方法是将特征量减少，但这会丢失高频细节。为了克服这些挑战，我们提出了一个\ emph {位移不变的成本计算模块}来计算匹配成本，而无需4D功能量。相反，成本是通过在每个差异偏移的特征映射对上独立应用相同的2D卷积网络来计算的。与以前的基于2D卷积的方法（仅在输入和差异图之间执行上下文映射）不同，我们提出的方法学会了以匹配两个图像之间的功能。我们还提出了一种基于熵的改进策略，以完善计算的差异图，该图可以通过避免在正确的图像上计算第二个差异图来进一步提高速度。在标准数据集（SceneFlow，Kitti，Eth3D和Middlebury）上进行的广泛实验表明，我们的方法以更少的推理时间达到了竞争精度。在典型的图像尺寸上，我们的方法在桌面GPU上处理超过100 fps的方法，使我们的方法适用于诸如自动驾驶之类的时间关键应用。我们还表明，我们的方法很好地概括了看不见的数据集，表现优于4D量化方法。

Although deep learning-based methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy, their inference time is typically slow, on the order of seconds for a pair of 540p images. The main reason is that the leading methods employ time-consuming 3D convolutions applied to a 4D feature volume. A common way to speed up the computation is to downsample the feature volume, but this loses high-frequency details. To overcome these challenges, we propose a \emph{displacement-invariant cost computation module} to compute the matching costs without needing a 4D feature volume. Rather, costs are computed by applying the same 2D convolution network on each disparity-shifted feature map pair independently. Unlike previous 2D convolution-based methods that simply perform context mapping between inputs and disparity maps, our proposed approach learns to match features between the two images. We also propose an entropy-based refinement strategy to refine the computed disparity map, which further improves speed by avoiding the need to compute a second disparity map on the right image. Extensive experiments on standard datasets (SceneFlow, KITTI, ETH3D, and Middlebury) demonstrate that our method achieves competitive accuracy with much less inference time. On typical image sizes, our method processes over 100 FPS on a desktop GPU, making our method suitable for time-critical applications such as autonomous driving. We also show that our approach generalizes well to unseen datasets, outperforming 4D-volumetric methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题