论文标题
注意力学习的对象嵌入可以使复杂的视觉推理
Attention over learned object embeddings enables complex visual reasoning
论文作者
论文摘要
神经网络已经在各种感知任务中取得了成功,但经常在涉及感知和高级推理的任务中失败。在这些更具挑战性的任务上,定制方法(例如,针对该特定类型的任务类型的靶向符号组件,独立动力学模型或语义解析器)通常表现更好。但是,这些目标方法的缺点是,它们比通用神经网络更易碎,需要根据手头的特定任务进行重大修改甚至重新设计。在这里,我们为动态视觉推理问题提出了一种更通用的基于神经网络的方法,该方法在三个不同的域上获得了最先进的性能,在每种情况下,都优于专门针对任务量身定制的定制模块化方法。我们的方法依赖于学到的以对象为中心的表示,自我注意力和自我监督的动态学习,并且所有三个要素都需要强大的绩效出现。这种组合的成功表明,在涉及时空或因果风格推理的问题上,可能无需为表现而权衡。有了正确的软偏见和在神经网络中的学习目标,我们也许能够达到两全其美。
Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic components, independent dynamics models or semantic parsers) targeted towards that specific type of task have typically performed better. The downside to these targeted approaches, however, is that they can be more brittle than general-purpose neural networks, requiring significant modification or even redesign according to the particular task at hand. Here, we propose a more general neural-network-based approach to dynamic visual reasoning problems that obtains state-of-the-art performance on three different domains, in each case outperforming bespoke modular approaches tailored specifically to the task. Our method relies on learned object-centric representations, self-attention and self-supervised dynamics learning, and all three elements together are required for strong performance to emerge. The success of this combination suggests that there may be no need to trade off flexibility for performance on problems involving spatio-temporal or causal-style reasoning. With the right soft biases and learning objectives in a neural network we may be able to attain the best of both worlds.