论文标题
快速注意的实时语义细分
Real-time Semantic Segmentation with Fast Attention
论文作者
论文摘要
在基于CNN的深层语义分割模型中,高精度依赖于丰富的空间环境(大型接受场)和精细的空间细节(高分辨率),这两者都会产生高计算成本。在本文中,我们提出了一种新颖的体系结构,既解决挑战又实现最先进的性能,以实时对高分辨率图像和视频进行语义细分。所提出的架构依赖于我们快速的空间关注,这是对流行的自我发挥机制的简单而有效的修改,并通过改变操作顺序,在一小部分计算成本下捕获了相同的丰富空间环境。此外,为了有效地处理高分辨率输入,我们将额外的空间减小应用于网络的中间特征阶段,由于将快速注意模块用于保险丝功能,因此准确性最小。我们通过一系列实验来验证我们的方法,并表明在多个数据集上的结果与现有的实时语义分割方法相比,在多个数据集上以更高的准确性和速度证明了卓越的性能。在CityScapes上,我们的网络以72 fps的价格达到74.4 $ \%$ MIOU,在单个Titan X GPU上以58 fps的价格达到75.5 $ \%$ \%$ MIOU,这是〜$ \ sim $ 50 $ \%$ \%$ \%$ \%$ \%$,同时保持相同的准确性。
In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations. Moreover, to efficiently process high-resolution input, we apply an additional spatial reduction to intermediate feature stages of the network with minimal loss in accuracy thanks to the use of the fast attention module to fuse features. We validate our method with a series of experiments, and show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches for real-time semantic segmentation. On Cityscapes, our network achieves 74.4$\%$ mIoU at 72 FPS and 75.5$\%$ mIoU at 58 FPS on a single Titan X GPU, which is~$\sim$50$\%$ faster than the state-of-the-art while retaining the same accuracy.