论文标题
部分可观测时空混沌系统的无模型预测
Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
论文作者
论文摘要
多模式3D对象检测一直是自动驾驶中的一个积极研究主题。然而,探索稀疏3D点和密集的2D像素之间的跨模式特征融合并不乏味。最近的方法要么将图像特征与点云特征融合在一起,该特征将投影到2D图像平面上,要么将稀疏点云与密集的图像像素结合在一起。这些融合方法通常会遭受严重的信息损失,从而导致次优性能。为了解决这些问题,我们在点云和图像之间构建了均匀的结构,以避免通过将相机功能转换为LIDAR 3D空间来避免投影信息丢失。在本文中,我们提出了用于3D对象检测的均质多模式融合和相互作用方法(HMFI)。具体而言,我们首先设计一个图像体素升降机模块(IVLM),以将2D图像特征提升到3D空间并生成均匀的图像体素特征。然后,通过引入基于自我注意的查询融合机制(QFM),我们将Voxelized点云特征与来自不同区域的图像特征融合在一起。接下来,我们提出一个体素特征交互模块(VFIM),以实现来自均质点云和图像素表示的相同对象的语义信息的一致性,该对象可以为对象级比对提供跨模式特征融合的对象对准指南,并在复杂的背景中增强歧视能力。我们在Kitti和Waymo Open数据集上进行了广泛的实验,与最先进的多模式方法相比,提议的HMFI取得了更好的性能。特别是,对于在Kitti基准测试中骑自行车的人的3D检测,HMFI超过了所有已发表的算法。
Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.