论文标题
Grit-VLP:分组的迷你批次抽样,以进行有效的视力和语言预训练
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
论文作者
论文摘要
当前现有的视觉和语言预训练(VLP)方法的大多数主要集中在如何提取和调整视觉和文本功能上。与主流VLP方法相反,我们强调指出,在训练预训练期间,通常采用两个常规应用步骤对预训练模型的性能至关重要:图像介绍(ITM)的批处理硬性负面采样(ITM),并为掩盖语言模型(MLM)分配大型掩盖概率(MLM)。在经验上显示了上述两个步骤的意外有效性后,我们系统地设计了砂粒vlp,该砂vlp可以自适应地采样小批次,以更有效地为ITM挖掘硬性否定样本,同时维持预训练的计算成本。我们的方法由三个组成部分组成:1)分组的迷你批次采样(砂砾)策略,该策略在迷你批次中收集了类似的示例,2)ITC一致性损失以提高采矿能力,而3)MLM的扩大掩蔽概率。因此,我们显示了我们的砂粒vlp在各种下游任务上实现了新的最新性能,计算成本要少得多。此外,我们证明了我们的模型基本上与以前的最先进的ALBEF相当,仅在相同的培训数据上培训时期三分之一。代码可在https://github.com/jaeseokbyun/grit-vlp上找到。
Most of the currently existing vision and language pre-training (VLP) methods have mainly focused on how to extract and align vision and text features. In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM). After empirically showing the unexpected effectiveness of above two steps, we systematically devise our GRIT-VLP, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM while maintaining the computational cost for pre-training. Our method consists of three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects similar examples in a mini-batch, 2) ITC consistency loss for improving the mining ability, and 3) enlarged masking probability for MLM. Consequently, we show our GRIT-VLP achieves a new state-of-the-art performance on various downstream tasks with much less computational cost. Furthermore, we demonstrate that our model is essentially in par with ALBEF, the previous state-of-the-art, only with one-third of training epochs on the same training data. Code is available at https://github.com/jaeseokbyun/GRIT-VLP.