论文标题
学会在遥远的监督下识别程序活动
Learning To Recognize Procedural Activities with Distant Supervision
论文作者
论文摘要
在本文中,我们考虑了从长时间的长时间视频到几分钟的长时间视频进行分类的问题(例如,烹饪不同的食谱,进行不同的家庭装修,创建各种形式的艺术和手工艺品)。准确地对这些活动进行分类不仅需要识别组成任务的各个步骤,还需要捕获其时间依赖性。这个问题与传统的动作分类大不相同,在传统的动作分类中,模型通常在仅跨越几秒钟的视频上进行了优化,并且可以手动修剪以包含简单的原子动作。虽然步骤注释可以使模型的培训能够识别程序活动的各个步骤,但由于长时间视频中手动注释时间界的超值成本,该领域的现有大规模数据集不包括此类段标签。为了解决这个问题,我们建议通过利用文本知识库(Wikihow)的遥远监督来自动确定教学视频中的步骤,其中包括执行各种复杂活动所需的步骤的详细描述。我们的方法使用语言模型来匹配视频中自动转录的语音,以在知识库中逐步描述。我们证明,经过训练的视频模型可以识别这些自动标记的步骤(无需手动监督)产生了在四个下游任务上实现卓越概括性能的表示:识别程序活动,步骤分类,步骤预测和以自我为中心的视频分类。
In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.