论文标题
BVIT:广泛的基于注意力的视力变压器
BViT: Broad Attention based Vision Transformer
论文作者
论文摘要
最近的作品表明,变压器可以通过自我注意力利用图像贴片之间的关系来实现计算机视觉的有希望的性能。尽管他们仅在一个特征层中考虑注意力,但忽略了不同级别的关注的互补性。在本文中,我们提出广泛的关注,以通过将不同层的视觉变压器的注意力关系结合在一起,以提高性能,这称为BVIT。广泛的关注是通过广泛的连接和无参数的关注来实现的。每个变压器层的广泛连接促进了BVIT信息的传输和集成。在不引入其他可训练参数的情况下,无参数的注意力共同关注不同层中已经可用的注意信息,以提取有用的信息并建立其关系。图像分类任务的实验表明,BVIT具有74.8 \%/81.6 \%TOP-1的最新精度,具有5m/22m参数的Imagenet上的TOP-1精度。此外,我们将BVIT转移到下游对象识别基准上,以在CIFAR10和CIFAR100上分别达到98.9 \%和89.9 \%,其参数超过VIT。对于概括测试,SWIN Transformer和T2T-VIT的广泛关注也带来了1 \%以上的改善。总而言之,广泛的关注有望促进基于注意力的模型的性能。代码和预培训模型可在https://github.com/drl-casia/broad_vit上找到。
Recent works have demonstrated that transformer can achieve promising performance in computer vision, by exploiting the relationship among image patches with self-attention. While they only consider the attention in a single feature layer, but ignore the complementarity of attention in different levels. In this paper, we propose the broad attention to improve the performance by incorporating the attention relationship of different layers for vision transformer, which is called BViT. The broad attention is implemented by broad connection and parameter-free attention. Broad connection of each transformer layer promotes the transmission and integration of information for BViT. Without introducing additional trainable parameters, parameter-free attention jointly focuses on the already available attention information in different layers for extracting useful information and building their relationship. Experiments on image classification tasks demonstrate that BViT delivers state-of-the-art accuracy of 74.8\%/81.6\% top-1 accuracy on ImageNet with 5M/22M parameters. Moreover, we transfer BViT to downstream object recognition benchmarks to achieve 98.9\% and 89.9\% on CIFAR10 and CIFAR100 respectively that exceed ViT with fewer parameters. For the generalization test, the broad attention in Swin Transformer and T2T-ViT also bring an improvement of more than 1\%. To sum up, broad attention is promising to promote the performance of attention based models. Code and pre-trained models are available at https://github.com/DRL-CASIA/Broad_ViT.