场景感知以自我为中心的3D人姿势估计

论文标题

场景感知以自我为中心的3D人姿势估计

Scene-aware Egocentric 3D Human Pose Estimation

论文作者

Wang, Jian, Liu, Lingjie, Xu, Weipeng, Sarkar, Kripasindhu, Luvizon, Diogo, Theobalt, Christian

论文摘要

以单个头部安装的鱼眼相机的估计为中心的3D人类姿势估计最近由于其在虚拟和增强现实中的众多应用而引起了人们的关注。现有的方法仍然在挑战性姿势中挣扎，因为人体高度阻塞或与现场紧密相互作用。为了解决这个问题，我们提出了一种场景感知的以自我为中心的姿势估计方法，该方法指导以场景约束为中心的姿势预测。为此，我们提出了一个以自我为中心的深度估计网络，以预测广阔的自我中心鱼眼摄像头的场景深度图，同时减轻具有深度启动网络的人体的阻塞。接下来，我们提出一个场景感知的姿势估计网络，该网络将2D图像特征和估计的场景深度图投射到体素空间中，并使用V2V网络回归3D姿势。基于体素的特征表示形式提供了2D图像特征和场景几何形状之间的直接几何连接，并进一步促进了V2V网络，以根据估计的场景几何形状来限制预测的姿势。为了启用上述网络的培训，我们还生成了一个称为Egogta的合成数据集，以及一个基于EGOPW的野外数据集，称为EGOPW-SCENE。我们的新评估序列的实验结果表明，就人类习惯相互作用而言，预测的3D Egentric姿势是准确且物理上合理的，这表明我们的方法在定量和质量上都优于最先进的方法。

Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题