令人难以置信的：以视觉为中心的自主驾驶的统一感知和预测

论文标题

令人难以置信的：以视觉为中心的自主驾驶的统一感知和预测

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

论文作者

Zhang, Yunpeng, Zhu, Zheng, Zheng, Wenzhao, Huang, Junjie, Huang, Guan, Zhou, Jie, Lu, Jiwen

论文摘要

在本文中，我们提出了Beverse，这是一个基于多相机系统的3D知觉和预测的统一框架。与现有的重点是改进单任务方法的研究不同，从多相机视频中产生时空鸟类视图（BEV）表示方面的敬意特征，并共同推理了以视觉为中心的自主驾驶的多个任务。具体而言，Beverse首先执行共享的特征提取和提起，以从多Timestamp和多视图图像生成4D BEV表示。自我动作比对之后，将时空编码器用于BEV中的进一步提取。最后，将多个任务解码器附加用于联合推理和预测。在解码器中，我们建议网格采样器生成具有不同范围和粒度不同任务的BEV特征。此外，我们设计了迭代流量的方法，以进行记忆有效的未来预测。我们表明，时间信息可以改善3D对象检测和语义图构建，而多任务学习可以隐含地受益于运动预测。通过在Nuscenes数据集上进行的大量实验，我们表明，在3D对象检测，语义映射构造和运动预测上，多任务Beverse的表现优于现有的单任务方法。与顺序范式相比，诱人的效率也有明显提高。代码和训练有素的模型将在https://github.com/zhangyp15/beverse上发布。

In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.

下载PDF全文

下载文献需遵守相关版权规定

论文标题