论文标题
基于知识的视频问题回答无监督的场景描述
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
论文作者
论文摘要
为了了解电影,人类不断推理特定场景中显示的对话和动作,并将其与已经看到的整体故事情节相关联。受此行为的启发,我们设计了一个基于知识的视频故事问题的模型,回答了电影理解的三个关键方面:对话理解,场景推理和故事情节回忆。在滚动中,这些任务中的每一个都负责通过1)处理场景对话提取丰富而多样的信息,2)生成无监督的视频场景描述,以及3)以弱监督的方式获得外部知识。为了正确回答给定的问题,每个启发的认知任务生成的信息都是通过变压器编码的,并通过模态加权机制进行融合,该机制平衡了来自不同来源的信息。详尽的评估证明了我们的方法的有效性,这在两个具有挑战性的视频问题回答数据集上产生了新的最新技术:知识VQA和TVQA+。
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.