论文标题
MVPTR:通过多阶段学习进行视觉预训练的多级语义对齐
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
论文作者
论文摘要
以前的视觉语言预训练模型主要构建具有令牌和对象(像素)的多模式输入,然后在它们之间进行跨模式相互作用。我们认为,只有令牌和对象的输入限制了诸如短语到区域接地之类的高级语义对齐。同时,多层次对齐本质上是一致的,并且能够协同促进表示。因此,在本文中,我们建议学习视觉预训练(MVPTR)的多级语义一致性。在MVPTR中,我们遵循两种方式的嵌套结构,以引入概念为高级语义。为了简化从多模式多级输入的学习,我们的框架分为两个阶段,第一阶段着重于模式内多级表示学习,第二阶段通过粗粒和细粒度的语义一致性任务来实施跨模态的第二阶段。除了常用的图像文本匹配和掩盖语言模型任务外,我们还引入了在第一阶段蒙版概念恢复任务,以增强概念表示学习,并在第二阶段进行另外两个任务,以明确鼓励跨模态的多层次对准。我们的代码可在https://github.com/junction4nako/mvp_pytorch上找到。
Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and objects (pixels) followed by performing cross-modality interaction between them. We argue that the input of only tokens and object features limits high-level semantic alignment like phrase-to-region grounding. Meanwhile, multi-level alignments are inherently consistent and able to facilitate the representation learning synergistically. Therefore, in this paper, we propose to learn Multi-level semantic alignment for Vision-language Pre-TRaining (MVPTR). In MVPTR, we follow the nested structure of both modalities to introduce concepts as high-level semantics. To ease the learning from multi-modal multi-level inputs, our framework is split into two stages, the first stage focuses on intra-modality multi-level representation learning, the second enforces interactions across modalities via both coarse-grained and fine-grained semantic alignment tasks. In addition to the commonly used image-text matching and masked language model tasks, we introduce a masked concept recovering task in the first stage to enhance the concept representation learning, and two more tasks in the second stage to explicitly encourage multi-level alignments across modalities. Our code is available at https://github.com/Junction4Nako/mvp_pytorch.