论文标题
将一个序列凝结到一个信息框架以供视频识别
Condensing a Sequence to One Informative Frame for Video Recognition
论文作者
论文摘要
视频很复杂,由于运动的较大变化和丰富的视觉细节中的丰富内容。从此类信息密集型媒体中抽象有用的信息需要详尽的计算资源。本文研究了两步替代方案,该替代方案首先将视频序列凝结为信息丰富的“帧”,然后在合成框架上利用现成的图像识别系统。一个有效的问题是如何定义“有用的信息”,然后将其从视频序列简化为一个合成框架。本文提出了一种新颖的信息框架综合(IFS)体系结构,其中包含三个客观任务,即外观重建,视频分类,运动估计和两个正则化器,即对抗性学习,颜色一致性。每个任务都使合成框架具有一种能力,而每个正规器都会增强其视觉质量。通过这些,通过以端到端的方式共同学习框架合成,预计生成的框架将封装所需的时空信息可用于视频分析。大规模动力学数据集进行了广泛的实验。与将视频序列映射到单个图像的基线方法相比,IFS显示出卓越的性能。更值得注意的是,IF始终证明了基于图像的2D网络和基于夹的3D网络的明显改进,并通过更低的计算成本的最新方法来实现可比性的性能。
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define "useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, i.e., appearance reconstruction, video categorization, motion estimation, and two regularizers, i.e., adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.