用于开放式录像带任务的预训练图像变压器

论文标题

用于开放式录像带任务的预训练图像变压器

Pre-training image-language transformers for open-vocabulary tasks

论文作者

Piergiovanni, AJ, Kuo, Weicheng, Angelova, Anelia

论文摘要

我们为视觉和语言变压器模型提供了一种预训练方法，该方法基于各种任务的混合。我们探索了在预训练中使用图像文本字幕数据的使用，这不需要其他监督，也需要对象感知的策略来预先培训模型。我们评估了许多文本视觉+语言任务的方法，例如视觉问题答案，视觉效果和字幕，并证明了对标准预训练方法的巨大收益。

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题