Mixformer：跨窗户和尺寸的混合功能

论文标题

Mixformer：跨窗户和尺寸的混合功能

MixFormer: Mixing Features across Windows and Dimensions

论文作者

Chen, Qiang, Wu, Qiman, Wang, Jian, Hu, Qinghao, Hu, Tao, Ding, Errui, Cheng, Jian, Wang, Jingdong

论文摘要

尽管本地窗口自我注意力在视力任务中表现出色，但它的接受领域和弱建模能力问题遭受了损害。这主要是因为它在非重叠的窗口内执行自我注意力，并在通道维度上共享权重。我们建议混合形式找到解决方案。首先，我们将局部窗口自我注意力与深度卷积相结合，在平行设计中建模跨窗口连接以扩大接受场。其次，我们提出了跨分支的双向相互作用，以提供通道和空间维度的互补线索。这两种设计已集成以实现窗户和尺寸之间的有效特征混合。我们的Mixformer通过EditiveNet提供了图像分类的竞争结果，并显示出比Regnet和Swin Transformer更好的结果。在下游任务中的性能优于其替代方案，而在5个密集的Coco，ADE20K和LVIS上的5个密集预测任务中，计算成本较小。代码可在\ url {https://github.com/paddlepaddle/paddleclas}中获得。

While local-window self-attention performs notably in vision tasks, it suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in the channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and shows better results than RegNet and Swin Transformer. Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS. Code is available at \url{https://github.com/PaddlePaddle/PaddleClas}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题