用视觉变压器进行语义分割的表示分离

论文标题

用视觉变压器进行语义分割的表示分离

Representation Separation for Semantic Segmentation with Vision Transformers

论文作者

Hong, Yuanduo, Pan, Huihui, Sun, Weichao, Yu, Xinghu, Gao, Huijun

论文摘要

视觉变压器（VITS）编码图像作为一系列贴片的序列带来了用于语义分割的新范式。我们在局部斑点级别和vits语义分割的局部斑点级别和全局区域级别提出了一个有效的表示形式分离的框架。它是针对语义分割中VIT的特殊平滑度的针对性，因此与当前流行的上下文建模范式和大多数现有相关方法的范式有所不同，从而增强了注意力的优势。我们首先提供了脱钩的两条路线网络，在该网络中，另一种途径可以增强并降低与变形金刚全球表示形式的局部斑点差异。然后，我们提出了空间自适应的分离模块，以获得更加独立的深度表示和歧视性交叉注意，从而通过新颖的辅助监督产生更具歧视性区域的表示。提出的方法获得了一些令人印象深刻的结果：1）与大规模的普通VIT合并，我们的方法在五个广泛使用的基准上实现了新的最先进的表演； 2）使用掩盖的预训练的普通VIT，我们在Pascal环境下达到68.9％的MIOU，创下新的记录； 3）与脱钩的两条通道网络集成的金字塔VIT甚至超过了精心设计的高分辨率VIT； 4) the improved representations by our framework have favorable transferability in images with natural corruptions.这些代码将公开发布。

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题