桥梁推出：在教学视频中迈向序律行动理解

论文标题

桥梁推出：在教学视频中迈向序律行动理解

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

论文作者

Li, Muheng, Chen, Lei, Duan, Yueqi, Hu, Zhilan, Feng, Jianjiang, Zhou, Jie, Lu, Jiwen

论文摘要

动作识别模型显示出在短视频剪辑中对人类行为进行分类的有希望的能力。在实际情况下，多种相关的人类行为通常在特定的命令中发生，形成语义上有意义的人类活动。常规行动识别方法的重点是分析单一动作。但是，他们没有充分理解相邻动作之间的上下文关系，这些行动为理解长视频提供了潜在的时间逻辑。在本文中，我们提出了一个基于及时的框架，即桥梁 - 推出（BR-Prompt），以对相邻动作进行建模，以便从教学视频中的一系列顺序动作中同时利用副本和上下文信息。更具体地说，我们将单个动作标签重新制定为综合文本提示进行监督，这弥合了个体动作语义之间的差距。生成的文本提示与相应的视频剪辑配对，并通过对比度方法共同培训文本编码器和视频编码器。学识渊博的视觉编码器具有更强的能力，可以实现与顺序行动相关的下游任务，例如行动细分和人类活动识别。我们在几个视频数据集上评估了方法的性能：佐治亚理工学院的egentric活动（GTEA），50萨拉德人和早餐数据集。 BR-Prompt在多个基准测试中实现了最先进的功能。代码可在https://github.com/ttlmh/bridge-prompt上获得

Action recognition models have shown a promising capability to classify human actions in short video clips. In a real scenario, multiple correlated human actions commonly occur in particular orders, forming semantically meaningful human activities. Conventional action recognition approaches focus on analyzing single actions. However, they fail to fully reason about the contextual relations between adjacent actions, which provide potential temporal logic for understanding long videos. In this paper, we propose a prompt-based framework, Bridge-Prompt (Br-Prompt), to model the semantics across adjacent actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos. More specifically, we reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach. The learned vision encoder has a stronger capability for ordinal-action-related downstream tasks, e.g. action segmentation and human activity recognition. We evaluate the performances of our approach on several video datasets: Georgia Tech Egocentric Activities (GTEA), 50Salads, and the Breakfast dataset. Br-Prompt achieves state-of-the-art on multiple benchmarks. Code is available at https://github.com/ttlmh/Bridge-Prompt

下载PDF全文

下载文献需遵守相关版权规定

论文标题