VitPose ++：通用身体姿势估计的视觉变压器

论文标题

VitPose ++：通用身体姿势估计的视觉变压器

ViTPose++: Vision Transformer for Generic Body Pose Estimation

论文作者

Xu, Yufei, Zhang, Jing, Zhang, Qiming, Tao, Dacheng

论文摘要

在本文中，我们在各个方面，模型结构的简单性，模型大小的可伸缩性，训练范式的灵活性，模型之间的知识可传递性，通过称为vitpose的简单基线模型之间的可传递性，来介绍纯视觉变压器的良好特性，即模型结构的简单性，模型大小的可伸缩性，训练范式的灵活性以及模型之间知识之间的可传递性。具体而言，VITPOSE采用普通和非层次视觉变压器作为编码器来编码特征和轻量级解码器，以自上而下或自下而上的方式解码身体关键点。通过利用可扩展模型容量和视觉变压器的高平行性，可以将其从约20m的参数扩展到1b参数，从而为吞吐量和性能设置了新的帕累托前部。此外，VitPose在注意力类型，输入分辨率以及预训练和微调策略方面非常灵活。基于灵活性，提出了一种新型的Vitpose+模型来处理不同类型的身体姿势估计任务中的异构身体关键点类别，即在变压器中采用任务 - 不合时件和特定于任务的馈电网络。我们还从经验上证明，大型vitpose模型的知识可以通过简单的知识令牌轻松地转移到小型模型中。实验结果表明，在自上而下和自下而上的设置上，我们的VITPOSE模型在具有挑战性的Coco Human Keypoint检测基准上优于代表性方法。 Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题