论文标题
内部图:探索具有可变形卷积的大型视觉基础模型
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
论文作者
论文摘要
与近年来大规模视觉变压器(VIT)的巨大进展相比,基于卷积神经网络(CNN)的大规模模型仍处于早期状态。这项工作提出了一种新的大型CNN基础模型,称为Interimage,可以从增加参数和培训数据(如VIT)中获得增益。与最近关注大型密度内核的CNN不同,Interimage将可变形的卷积作为核心操作员,因此我们的模型不仅具有下游任务所需的较大有效接收场,例如检测和分段,而且具有自适应的空间汇总,以输入和任务信息来调节。结果,提议的内形降低了传统CNN的严格感应偏置,并通过从诸如VIT(例如VIT)的大量数据中学习更强大,更健壮的模式。我们的模型的有效性已被证明是在包括Imagenet,Coco和ADE20K在内的具有挑战性的基准。值得一提的是,Interimage-H在可可Test-DEV上获得了新的记录65.4地图,而ADE20K上的62.9 MIOU则表现优于当前领先的CNN和VIT。该代码将在https://github.com/opengvlab/internimage上发布。
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.