论文标题
通过学习深入汇总表示,有效的人姿势估计
Efficient Human Pose Estimation by Learning Deeply Aggregated Representations
论文作者
论文摘要
在本文中,我们通过学习深入的汇总表示,提出了有效的人类姿势估计网络(DANET)。大多数现有模型主要来自具有不同空间尺寸的功能的多尺度信息。强大的多尺度表示通常依赖于级联的金字塔框架。该框架在很大程度上促进了性能,但同时使网络变得非常深刻而复杂。取而代之的是,我们专注于从具有不同接收场尺寸的图层中利用多尺度信息,然后通过改进融合方法来充分使用此信息。具体而言,我们提出了一个正交注意块(OAB)和二阶融合单元(SFU)。 OAB从不同层中学习多尺度信息,并通过鼓励它们多样化来增强它们。 SFU可以自适应地选择和融合多种多尺度信息并抑制冗余信息。这可以最大程度地提高最终融合表示形式中的有效信息。在OAB和SFU的帮助下,我们的单个金字塔网络可能能够生成深度汇总的表示,这些表示包含更丰富的多尺度信息,并且比级联网络具有更大的代表能力。因此,我们的网络可以通过较小的模型复杂性实现可比较甚至更好的准确性。具体而言,我们的\ mbox {Danet-72}在CoCo Test-Dev设置上获得了$ 70.5 $的AP分数,仅$ 1.0G $ flops。它在CPU平台上的速度可实现$ 58 $的每秒〜(PPS)。
In this paper, we propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations. Most existing models explore multi-scale information mainly from features with different spatial sizes. Powerful multi-scale representations usually rely on the cascaded pyramid framework. This framework largely boosts the performance but in the meanwhile makes networks very deep and complex. Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method. Specifically, we propose an orthogonal attention block (OAB) and a second-order fusion unit (SFU). The OAB learns multi-scale information from different layers and enhances them by encouraging them to be diverse. The SFU adaptively selects and fuses diverse multi-scale information and suppress the redundant ones. This could maximize the effective information in final fused representations. With the help of OAB and SFU, our single pyramid network may be able to generate deeply aggregated representations that contain even richer multi-scale information and have a larger representing capacity than that of cascaded networks. Thus, our networks could achieve comparable or even better accuracy with much smaller model complexity. Specifically, our \mbox{DANet-72} achieves $70.5$ in AP score on COCO test-dev set with only $1.0G$ FLOPs. Its speed on a CPU platform achieves $58$ Persons-Per-Second~(PPS).