EDGEVITS：带有视觉变压器的移动设备上的轻型CNN竞争

论文标题

EDGEVITS：带有视觉变压器的移动设备上的轻型CNN竞争

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

论文作者

Pan, Junting, Bulat, Adrian, Tan, Fuwen, Zhu, Xiatian, Dudziak, Lukasz, Li, Hongsheng, Tzimiropoulos, Georgios, Martinez, Brais

论文摘要

基于自我注意力的模型（例如视觉变压器（VIT））已成为计算机视觉中卷积神经网络（CNN）的一种非常有竞争力的架构。尽管越来越高的变体具有更高的识别精度，但由于自我注意的二次复杂性，现有的VIT通常在计算和模型大小中要求。尽管已重新引入了最近的CNN的几种成功设计选择（例如，卷积和分层多阶段结构）已重新引入最近的VIT，但它们仍然不足以满足移动设备的有限资源要求。这激发了最近根据最先进的Mobilenet-V2开发光线的尝试，但仍然留下了性能差距。在这项工作中，沿着这个研究不足的方向进一步推动了EdgeVits，这是一个新的轻巧vits家族，这首先使基于注意力的视力模型能够与最佳的轻巧CNN竞争，以在准确性和设备效率之间进行交易。这是通过基于自我注意力和卷积的最佳整合而引入高度成本效益的本地 - 全球局（LGL）信息交换瓶颈来实现的。为了进行设备的评估，而不是依靠诸如拖船或参数数量之类的不准确代理，我们采用了一种实际的方法来直接专注于设备潜伏期，以及首次首次提供能源效率。具体而言，我们表明，当考虑准确性的延迟和准确性 - 能量折衷时，我们的模型是最佳的，在几乎所有情况下都与其他VIT相比，并与最有效的CNN竞争。代码可从https://github.com/saic-fi/edgevit获得。

Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy trade-offs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs. Code is available at https://github.com/saic-fi/edgevit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题