一个用于强大和实时RGB-D显着对象检测的单流网络

论文标题

一个用于强大和实时RGB-D显着对象检测的单流网络

A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection

论文作者

Zhao, Xiaoqi, Zhang, Lihe, Pang, Youwei, Lu, Huchuan, Zhang, Lei

论文摘要

现有的RGB-D显着对象检测（SOD）方法集中在RGB流和深度流之间的跨模式融合上。他们没有深入探索深度图本身的效果。在这项工作中，我们设计了一个单流网络，以直接使用深度图来指导RGB和深度之间的早期融合和中间融合，从而保存了深度流的特征编码器并实现了轻巧和实时的模型。我们从两个角度巧妙地利用了深度信息：（1）克服由模式之间的巨大差异引起的不兼容问题，我们构建了一个单流编码器来实现早期融合，这可以充分利用ImageNet预训练的骨干模型来提取丰富和歧视性。（2）我们设计了一个新型的深度增强双重注意模块（DEDA），以有效地为前/后台分支提供空间过滤的特征，从而使解码器能够最佳地执行中间融合。此外，我们提出了一个栏位参与的特征提取模块（PAFE），以准确定位不同尺度的对象。广泛的实验表明，所提出的模型在不同评估指标下对大多数最新方法的表现有利。此外，该型号比当前最轻的模型轻55.5 \％，在处理$ 384 \ times 384 $图像时，以32 fps的实时速度运行。

Existing RGB-D salient object detection (SOD) approaches concentrate on the cross-modal fusion between the RGB stream and the depth stream. They do not deeply explore the effect of the depth map itself. In this work, we design a single stream network to directly use the depth map to guide early fusion and middle fusion between RGB and depth, which saves the feature encoder of the depth stream and achieves a lightweight and real-time model. We tactfully utilize depth information from two perspectives: (1) Overcoming the incompatibility problem caused by the great difference between modalities, we build a single stream encoder to achieve the early fusion, which can take full advantage of ImageNet pre-trained backbone model to extract rich and discriminative features. (2) We design a novel depth-enhanced dual attention module (DEDA) to efficiently provide the fore-/back-ground branches with the spatially filtered features, which enables the decoder to optimally perform the middle fusion. Besides, we put forward a pyramidally attended feature extraction module (PAFE) to accurately localize the objects of different scales. Extensive experiments demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics. Furthermore, this model is 55.5\% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 \times 384$ image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题