论文标题
通过最大化多模式相互信息来迈向多合一的预训练
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
论文作者
论文摘要
为了有效利用大规模模型的潜力,提出了各种来自不同来源数据支持的训练策略,包括受监督的预训练,弱监督的预训练和自我监督的预培训。已经证明,将多种训练策略和来自各种方式/来源的数据结合起来可以大大提高大型模型的培训。但是,当前的作品采用了多阶段的预训练系统,其中复杂的管道可能会增加预训练的不确定性和不稳定。因此,希望这些策略可以单阶段的方式集成。在本文中,我们首先提出了一个一般的多模式相互信息公式作为统一的优化目标,并证明所有现有方法都是我们框架的特殊情况。从这个统一的角度来看,我们提出了一种多合一的单阶段预训练方法,称为最大化多模式相互信息预训练(M3I预训练)。我们的方法比以前在各种视觉基准的预训练方法(包括ImageNet分类,可可对象检测,LVIS长尾对象检测和ADE20K语义分割)上实现了更好的性能。值得注意的是,我们成功地预先培训了十亿级的参数图像主链,并在各种基准上实现最先进的性能。代码应在https://github.com/opengvlab/m3i-预测上发布。
To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all existing approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.