基于骨架的动作识别的时空焦点

论文标题

基于骨架的动作识别的时空焦点

SpatioTemporal Focus for Skeleton-based Action Recognition

论文作者

Wu, Liyu, Zhang, Can, Zou, Yuexian

论文摘要

图形卷积网络（GCN）在基于骨架的动作识别中被广泛采用，因为它们具有强大的数据拓扑模型。我们认为，最近提出的基于骨架的动作识别方法的性能受到以下因素的限制。首先，预定义的图结构在整个网络中共享，缺乏对多彩色语义信息建模的灵活性和能力。其次，全球关节之间的关系并未被局部卷积图完全利用，这可能会失去隐式关节相关性。例如，诸如跑步和挥舞之类的动作是通过身体部位和关节（例如腿和手臂）共同移动来执行的，但是它们位于物理连接中。受到最新注意机制的启发，我们提出了一个称为MCF的多彩色上下文焦点模块，以从身体关节和部分捕获相关的关系信息。结果，MCF可以获得不同骨架动作序列的更可解释的表示。在这项研究中，我们遵循了一种共同的做法，即采用了输入骨骼序列的密集样本策略，这带来了很多冗余，因为数量与行动无关。为了减少冗余，开发了一个时间歧视焦点模块，称为TDF，以捕获时间动力学的局部敏感点。 MCF和TDF集成到标准GCN网络中，形成一个名为STF-NET的统一体系结构。据指出，基于多粒度上下文聚合和时间依赖性，STF-NET提供了从这些骨架拓扑结构中捕获强大运动模式的能力。广泛的实验结果表明，我们的STF-NET显着实现了三个具有挑战性的基准NTU RGB+D 60，NTU RGB+D 120和Kinetics-Skeleton上的最新结果。

Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition due to their powerful ability to model data topology. We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors. First, the predefined graph structures are shared throughout the network, lacking the flexibility and capacity to model the multi-grain semantic information. Second, the relations among the global joints are not fully exploited by the graph local convolution, which may lose the implicit joint relevance. For instance, actions such as running and waving are performed by the co-movement of body parts and joints, e.g., legs and arms, however, they are located far away in physical connection. Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information from the body joints and parts. As a result, more explainable representations for different skeleton action sequences can be obtained by MCF. In this study, we follow the common practice that the dense sample strategy of the input skeleton sequences is adopted and this brings much redundancy since number of instances has nothing to do with actions. To reduce the redundancy, a temporal discrimination focus module, termed TDF, is developed to capture the local sensitive points of the temporal dynamics. MCF and TDF are integrated into the standard GCN network to form a unified architecture, named STF-Net. It is noted that STF-Net provides the capability to capture robust movement patterns from these skeleton topology structures, based on multi-grain context aggregation and temporal dependency. Extensive experimental results show that our STF-Net significantly achieves state-of-the-art results on three challenging benchmarks NTU RGB+D 60, NTU RGB+D 120, and Kinetics-skeleton.

下载PDF全文

下载文献需遵守相关版权规定

论文标题