论文标题
视觉常识推理的联合答复和解释
Joint Answering and Explanation for Visual Commonsense Reasoning
论文作者
论文摘要
视觉常识性推理(VCR)被认为是视觉问题回答(VQA)的一个具有挑战性的扩展,努力追求更高的视觉理解。它由两个必不可少的过程组成:关于给定图像的问题和答案解释的理由推断。多年来,各种解决VCR的方法都提高了基准数据集的性能。尽管这些方法很重要,但它们通常以单独的方式处理这两个过程,因此将VCR分解为两个无关的VQA实例。结果,问题回答与理由推论之间的关键联系被中断,从而使现有的努力降低了视觉推理的忠诚。为了从经验研究此问题,我们就语言快捷方式和概括能力进行了一些深入的探索,以验证这种治疗的陷阱。根据我们的发现,在本文中,我们提出了一个插件知识蒸馏增强的框架,以将问题答案和理由推理过程融合在一起。关键的贡献是引入一个新颖的分支,该分支是进行连接过程的桥梁。鉴于我们的框架是模型不可分割的,我们将其应用于现有的流行基线,并在基准数据集上验证其有效性。正如实验结果所详述的那样,当配备我们的框架时,这些基线实现了一致且显着的性能提高,证明了过程耦合的生存能力以及所提出的框架的优越性。
Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of this treatment. Based on our findings, in this paper, we present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution is the introduction of a novel branch, which serves as the bridge to conduct processes connecting. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As detailed in the experimental results, when equipped with our framework, these baselines achieve consistent and significant performance improvements, demonstrating the viability of processes coupling, as well as the superiority of the proposed framework.