论文标题
伯特看到了什么:视觉问题的跨模式转移
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
论文作者
论文摘要
预训练的语言模型最近为NLP任务的重大进展做出了贡献。最近,已经开发了多模式版本的BERT,使用了依赖大量的一致的文本和图像数据的大量预训练,主要应用于诸如VQA之类的分类任务。在本文中,我们有兴趣通过避免对补充数据进行预培训来评估BERT开箱即用的视觉功能。我们选择研究视觉问题的生成,这是接地对话的重大兴趣的任务,它可以研究每种方式的影响(因为输入可以是视觉和/或文本)。此外,由于BERT主要设计为编码器,因此任务的生成方面需要适应。我们介绍了Bert-Gen,这是一种基于BERT的架构,用于文本生成,能够利用单模式或多模式表示。在不同配置下报告的结果表明,即使有很少的数据,避免昂贵的预训练,伯特 - 基因的天生能力也适应了多模式数据和文本生成。所提出的模型对两个已建立的VQG数据集的最先进的模型进行了实质性改进。
Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.