接下来更有可能发生什么？视频和语言未来活动预测

论文标题

接下来更有可能发生什么？视频和语言未来活动预测

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

论文作者

Lei, Jie, Yu, Licheng, Berg, Tamara L., Bansal, Mohit

论文摘要

给定与对话结盟的视频，人们通常可以推断下接下来更有可能发生的事情。做出这样的预测不仅需要对视频和对话基础的丰富动态有深入的了解，而且还需要大量的常识知识。在这项工作中，我们探讨了AI模型是否能够学习做出这种多模式常识的下一事件预测。为了支持这个方向的研究，我们收集了一个名为“视频和语言事件预测”的新数据集（VLEP），其中有28,726个未来的事件预测示例（以及他们的理由）来自10,234个不同的电视节目和YouTube Lifestyle Lifestyle Vlog视频视频剪辑。为了促进非平凡挑战的例子的收集，我们采用了对抗性的人类和模型的数据收集程序。我们还提出了一个强大的基线，其中包含来自视频，对话和常识知识的信息。实验表明，每种类型的信息对于这项具有挑战性的任务都是有用的，与VLEP上的人类表现高相比，我们的模型提供了一个很好的起点，但为将来的工作留下了巨大的空间。我们的数据集和代码可在以下网址找到：https：//github.com/jayleicn/videolanguefuturepred

Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work. Our dataset and code are available at: https://github.com/jayleicn/VideoLanguageFuturePred

下载PDF全文

下载文献需遵守相关版权规定

论文标题