论文标题
深度指导自适应元融合网络,用于几次视频识别
Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
论文作者
论文摘要
人类只能通过给出一些示例轻松识别动作,而现有的视频识别模型仍然很大程度上依赖于大型标记的数据输入。该观察结果激发了人们对少量视频动作识别的兴趣,该识别旨在仅使用很少的标签样本来学习新动作。在本文中,我们提出了一个深度引导的自适应元融合网络,以进行几次识别的视频识别,该视频被称为AMEFU-NET。具体而言,我们从三个方面解决了几个射击识别问题:首先,我们通过将深度信息作为场景载体引入深度信息来减轻这个极为数据的筛选问题,这将为我们的模型带来额外的视觉信息;其次,我们将原始RGB剪辑的表示与通过我们的时间异步增强机制采样的多个非刻痕的深度剪辑融合在一起,该剪辑综合了功能级时的新实例。第三,提出了一种新型的深度引导自适应实例归一化(DGADAIN)融合模块,以有效融合两流模态。此外,为了更好地模仿少数射击识别过程,我们的模型以元学习方式进行了训练。对几个动作识别基准的广泛实验证明了我们模型的有效性。
Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.