TRESSCAM：针对弱监督语义分割的基于变压器注意的CAM精炼

论文标题

TRESSCAM：针对弱监督语义分割的基于变压器注意的CAM精炼

TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation

论文作者

Li, Ruiwen, Mai, Zheda, Trabelsi, Chiheb, Zhang, Zhibo, Jang, Jongseong, Sanner, Scott

论文摘要

仅具有图像级监督的弱监督语义细分（WSSS）是一项具有挑战性的任务。大多数现有方法利用类激活图（CAM）生成像素级伪标签进行监督培训。但是，由于卷积神经网络（CNN）的局部接受场，CAM应用于CNN通常会遭受部分激活的痛苦 - 突出显示了最歧视的部分，而不是整个对象区域。为了同时捕获局部特征和全局表示形式，已提出构象异构体将视觉变压器分支与CNN分支相结合。在本文中，我们提出了TransCAM，这是对WSSS的基于构象异构体的解决方案，该解决方案明确利用了构象异构体的变压器分支的注意力权重，以优化从CNN分支产生的CAM。 TransCAM的动机是我们观察到的，即浅层变压器块的注意力重量能够捕获低级空间特征相似性，而深层变压器块的注意力重量捕获了高级语义上下文。尽管它很简单，但TransCAM还是在各自的Pascal VOC 2012验证和测试集上达到了69.3％和69.6％的新最新性能，显示了WSSS的CAM基于Transformer Goationer注意力的有效性。

Weakly supervised semantic segmentation (WSSS) with only image-level supervision is a challenging task. Most existing methods exploit Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, due to the local receptive field of Convolution Neural Networks (CNN), CAM applied to CNNs often suffers from partial activation -- highlighting the most discriminative part instead of the entire object area. In order to capture both local features and global representations, the Conformer has been proposed to combine a visual transformer branch with a CNN branch. In this paper, we propose TransCAM, a Conformer-based solution to WSSS that explicitly leverages the attention weights from the transformer branch of the Conformer to refine the CAM generated from the CNN branch. TransCAM is motivated by our observation that attention weights from shallow transformer blocks are able to capture low-level spatial feature similarities while attention weights from deep transformer blocks capture high-level semantic context. Despite its simplicity, TransCAM achieves a new state-of-the-art performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets, showing the effectiveness of transformer attention-based refinement of CAM for WSSS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题