TSP：视频编码器的时间敏感预处理用于本地化任务

论文标题

TSP：视频编码器的时间敏感预处理用于本地化任务

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

论文作者

Alwassel, Humam, Giancola, Silvio, Ghanem, Bernard

论文摘要

由于未修剪视频的内存占地面积较大，因此当前的最新视频本地化方法在预先计算的视频剪辑功能上运行。这些功能是从经过修剪的动作分类任务训练的视频编码器中提取的，使此类功能不一定适合时间定位。在这项工作中，我们提出了一种新颖的监督剪辑特征范式的范式，不仅训练了对活动进行分类的训练，还考虑了背景剪辑和全球视频信息，以提高时间敏感性。广泛的实验表明，使用我们新颖的训练策略训练的功能可显着提高有关三个任务的最新方法的性能：时间动作定位，动作提案生成和密集的视频字幕。我们还表明，在三个编码器体系结构和两个预处理数据集中，我们的预读方法有效。我们认为，视频功能编码是本地化算法的重要组成部分，提取时间敏感的功能对于构建更准确的模型至关重要。代码和验证的模型可在我们的项目网站上找到。

Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are extracted from video encoders typically trained for trimmed action classification tasks, making such features not necessarily suitable for temporal localization. In this work, we propose a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information to improve temporal sensitivity. Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning. We also show that our pretraining approach is effective across three encoder architectures and two pretraining datasets. We believe video feature encoding is an important building block for localization algorithms, and extracting temporally-sensitive features should be of paramount importance in building more accurate models. The code and pretrained models are available on our project website.

下载PDF全文

下载文献需遵守相关版权规定

论文标题