论文标题
PEVL:对视觉模型的位置增强的预训练和及时调整
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
论文作者
论文摘要
视觉语言预训练(VLP)在广泛的跨模式任务上表现出令人印象深刻的性能,由于其出色的计算效率和竞争性能,因此不依赖对象探测器的VLP模型正在成为主流。但是,对象检测器的去除还剥夺了VLP模型在显式对象建模中的能力,这对于各种对位置敏感的视觉语言(VL)任务至关重要,例如引用表达理解和视觉常识性推理。为了应对挑战,我们介绍了PEVL,以增强具有显式对象位置建模的VLP模型的预训练和及时调整。具体而言,PEVL在统一的语言建模框架中重新定义了离散的对象位置和语言,这有助于预训练期间明确的VL对齐,还可以灵活地对各种下游任务进行灵活的及时调整。我们表明,PEVL可以在对位置敏感的任务(例如引用表达理解和短语接地)上实现无探测器VLP模型的最先进性能,并提高了具有接地输入的对位置不敏感任务的性能。我们在https://github.com/thunlp/pevl上公开提供本文的数据和代码。
Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit object modeling, which is essential to various position-sensitive vision-language (VL) tasks, such as referring expression comprehension and visual commonsense reasoning. To address the challenge, we introduce PEVL that enhances the pre-training and prompt tuning of VLP models with explicit object position modeling. Specifically, PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs. We make the data and code for this paper publicly available at https://github.com/thunlp/PEVL.