我们有很多共同点：在视频中建模语义关系集抽象

论文标题

我们有很多共同点：在视频中建模语义关系集抽象

We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos

论文作者

Andonian, Alex, Fosco, Camilo, Monfort, Mathew, Lee, Allen, Feris, Rogerio, Vondrick, Carl, Oliva, Aude

论文摘要

确定事件之间的共同模式是人类和机器感知的关键能力，因为它是智能决策的基础。我们提出了一种学习视频的学习语义关系集抽象的方法，灵感来自人类学习。我们将视觉功能与自然语言监督相结合，以在一组视频中产生相似性的高级表示。这使我们的模型可以执行诸如集合抽象（在一组视频中有一个一般概念？），设置完成（哪个新视频与集合相关的一般概念？）和奇怪的检测（哪个视频不属于集合？）。及时进行了两个视频基准，动力学和多音调的实验，表明当学习识别集合之间的共同点时，出现了强大而多功能的表示。我们将我们的模型与几种基线算法进行比较，并表明通过语义监督明确学习关系摘要产生了重大改进。

Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题