INCEPFORMER：具有金字塔池的高效启动变压器用于语义分割

论文标题

INCEPFORMER：具有金字塔池的高效启动变压器用于语义分割

IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation

论文作者

Fu, Lihua, Tian, Haoyue, Zhai, Xiangping Bryce, Gao, Pan, Peng, Xiaojiang

论文摘要

语义细分通常受益于全球上下文，良好的本地化信息，多尺度功能等。为了推进基于变压器的细分器，我们提供了一种简单而强大的语义分割体系结构，称为Incepformer。 Incepformer具有以下两个关键贡献。首先，它引入了一种新型的金字塔结构化变压器编码器，该编码器同时收获了全局上下文和精细的本地化特征。这些特征是串联的，并将其馈入卷积层以进行最终的每像素预测。其次，Incepformer在每个自我发项式层中都集成了类似于Inception的架构，并在每个自我发项式层中进行了轻巧的馈电模块，从而有效地获得了丰富的本地多规模对象特征。在五个基准上进行的广泛实验表明，我们的Incepformer在准确性和速度方面都优于最先进的方法，例如1）我们的IncePformer-S实现ADE20K上的47.7％MIOU，其表现超过现有方法，而现有的最佳方法的效果仅为1％，而只需一半参数和较少的参数。 2）我们的Incepformer-B最终在具有3960万参数的CityScapes数据集上达到82.0％MIOU。可用代码：github.com/shendu0321/cencepformer。

Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题