巴洛（Barlow）限制了视觉问题回答的优化

论文标题

巴洛（Barlow）限制了视觉问题回答的优化

Barlow constrained optimization for Visual Question Answering

论文作者

Jha, Abhishek, Patro, Badri N., Van Gool, Luc, Tuytelaars, Tinne

论文摘要

视觉问题回答是一项视觉和语言多模式任务，旨在预测问题和图像方式中的样本的答案。最近的方法着重于学习图像和问题的良好关节嵌入空间，要么通过改善这两种方式之间的相互作用，要么通过使其更具歧视空间来学习。但是，这个联合空间的信息性尚未得到很好的探索。在本文中，我们提出了一种使用Barlow理论（COB）限制优化的VQA模型的新颖正则化，该正规化通过最小化冗余来改善关节空间的信息内容。它减少了学到的特征组件之间的相关性，从而消除了语义概念。我们的模型还将关节空间与答案嵌入空间保持一致，我们将答案和图像+问题视为两个不同的“视图”本质上是相同的语义信息。我们提出了一项有限的优化政策，以平衡分类和冗余最小化的力量。当建立在最新的GGE模型上时，所得模型分别在VQA-CP V2和VQA V2数据集上提高了VQA准确性1.4％和4％。该模型还具有更好的解释性。

Visual question answering is a vision-and-language multimodal task, that aims at predicting answers given samples from the question and image modalities. Most recent methods focus on learning a good joint embedding space of images and questions, either by improving the interaction between these two modalities, or by making it a more discriminant space. However, how informative this joint space is, has not been well explored. In this paper, we propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB), that improves the information content of the joint space by minimizing the redundancy. It reduces the correlation between the learned feature components and thereby disentangles semantic concepts. Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different `views' of what in essence is the same semantic information. We propose a constrained optimization policy to balance the categorical and redundancy minimization forces. When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4% and 4% on the VQA-CP v2 and VQA v2 datasets respectively. The model also exhibits better interpretability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题