论文标题
INCEPFORMER:具有金字塔池的高效启动变压器用于语义分割
IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation
论文作者
论文摘要
语义细分通常受益于全球上下文,良好的本地化信息,多尺度功能等。为了推进基于变压器的细分器,我们提供了一种简单而强大的语义分割体系结构,称为Incepformer。 Incepformer具有以下两个关键贡献。首先,它引入了一种新型的金字塔结构化变压器编码器,该编码器同时收获了全局上下文和精细的本地化特征。这些特征是串联的,并将其馈入卷积层以进行最终的每像素预测。其次,Incepformer在每个自我发项式层中都集成了类似于Inception的架构,并在每个自我发项式层中进行了轻巧的馈电模块,从而有效地获得了丰富的本地多规模对象特征。在五个基准上进行的广泛实验表明,我们的Incepformer在准确性和速度方面都优于最先进的方法,例如1)我们的IncePformer-S实现ADE20K上的47.7%MIOU,其表现超过现有方法,而现有的最佳方法的效果仅为1%,而只需一半参数和较少的参数。 2)我们的Incepformer-B最终在具有3960万参数的CityScapes数据集上达到82.0%MIOU。可用代码:github.com/shendu0321/cencepformer。
Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.