转换驱动的视觉推理

论文标题

转换驱动的视觉推理

Transformation Driven Visual Reasoning

论文作者

Hong, Xin, Lan, Yanyan, Pang, Liang, Guo, Jiafeng, Cheng, Xueqi

论文摘要

本文通过引入一个重要因素，即〜转换来定义新的视觉推理范式。动机源于以下事实：大多数现有的视觉推理任务（例如VQA中的CLEVR）仅定义以测试机器在静态设置中的概念和关系的程度，例如一个图像。我们认为，这种\ textbf {状态驱动的视觉推理}方法在反映机器是否能够推断不同状态之间的动态方面存在局限性，这已被证明与伯爵理论中人类认知的状态级别的推理至关重要。为了解决这个问题，我们提出了一个新颖的\ textbf {转换驱动的视觉推理}任务。鉴于初始状态和最终状态，目标是分别以三重态（对象，属性，值）或一系列三重态来推断相应的单步或多步变换。按照此定义，一个新的数据集，即基于CLEVR构建，包括三个级别的设置，即〜基本（单步变换），事件（多步变换）和视图（多步变换具有变体视图）。实验结果表明，最先进的视觉推理模型在基本方面表现良好，但远非事件和视图对人类水平的智能。我们认为，提出的新范式将促进机器视觉推理的发展。需要朝这个方向研究更高级的方法和实际数据。 TVR的资源可在https://hongxin2019.github.io/tvr上获得。

This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.~transformation. The motivation comes from the fact that most existing visual reasoning tasks, such as CLEVR in VQA, are solely defined to test how well the machine understands the concepts and relations within static settings, like one image. We argue that this kind of \textbf{state driven visual reasoning} approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states, which has been shown as important as state-level reasoning for human cognition in Piaget's theory. To tackle this problem, we propose a novel \textbf{transformation driven visual reasoning} task. Given both the initial and final states, the target is to infer the corresponding single-step or multi-step transformation, represented as a triplet (object, attribute, value) or a sequence of triplets, respectively. Following this definition, a new dataset namely TRANCE is constructed on the basis of CLEVR, including three levels of settings, i.e.~Basic (single-step transformation), Event (multi-step transformation), and View (multi-step transformation with variant views). Experimental results show that the state-of-the-art visual reasoning models perform well on Basic, but are still far from human-level intelligence on Event and View. We believe the proposed new paradigm will boost the development of machine visual reasoning. More advanced methods and real data need to be investigated in this direction. The resource of TVR is available at https://hongxin2019.github.io/TVR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题