插件VQA：通过将大型验证型号与零训练相结合的零击VQA

论文标题

插件VQA：通过将大型验证型号与零训练相结合的零击VQA

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

论文作者

Tiong, Anthony Meng Huat, Li, Junnan, Li, Boyang, Savarese, Silvio, Hoi, Steven C. H.

论文摘要

视觉问题回答（VQA）是视觉和语言推理的标志，也是零拍设置下的具有挑战性的任务。我们提出了插件VQA（PNP-VQA），这是一个用于零击VQA的模块化框架。与大多数现有的作品相反，这需要对视觉方式进行实质性适应的语言模型（PLM），PNP-VQA不需要对PLM进行额外的培训。取而代之的是，我们建议将自然语言和网络解释用作中间表示，该表示将模型验证在一起。我们首先生成问题引导的信息图像标题，然后将字幕传递给PLM作为问题的背景。 PNP-VQA超过了端到端训练的基线，在零射击VQAV2和GQA上实现了最先进的结果。使用11B参数，它在VQAV2上的表现优于80B参数Flamingo模型。 PNP-VQA具有738M PLM参数，在740m PLM参数上，GQA的GQA上的提高了9.1％。代码在https://github.com/salesforce/lavis/tree/main/projects/pnp-vqa上发布

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa

下载PDF全文

下载文献需遵守相关版权规定

论文标题