论文标题
在哪里,什么,是否:多模式学习符合行人检测
Where, What, Whether: Multi-modal Learning Meets Pedestrian Detection
论文作者
论文摘要
深度卷积神经网络(CNN)的行人检测很大。但是,在遮挡和尺度变化的情况下,CNN本质上很难处理情况。在本文中,我们提出了w $^3 $ net,它试图通过将行人检测任务分解为\ textbf {\ textIt {w}}来应对上述挑战。具体来说,对于行人实例,我们通过三个步骤制定其功能。 i)我们生成了一个鸟视图,该图自然没有遮挡问题,并扫描其所有点以寻找每个行人实例的合适位置。 ii)我们没有利用预固定的锚,而是建模深度和刻度之间的相互依赖性,旨在在不同位置生成深度引导尺度,以更好地匹配不同尺寸的实例。 iii)我们学到了一个由视觉和语料库空间共享的潜在矢量,通过该垂直结构相似但缺乏人类部分特征的假阳性将被滤除。我们在广泛使用的数据集(CityPersons和Caltech)上实现最先进的结果。尤其。当对重闭塞子集进行评估时,我们的结果将MR $^{ - 2} $从49.3 $ \%$ \%$降低到Citypersons的18.7 $ \%$,而从45.18 $ \%$ \%$ \%$ \%$ \%$ \%$ \%\%$ \%$ $ \%$。
Pedestrian detection benefits greatly from deep convolutional neural networks (CNNs). However, it is inherently hard for CNNs to handle situations in the presence of occlusion and scale variation. In this paper, we propose W$^3$Net, which attempts to address above challenges by decomposing the pedestrian detection task into \textbf{\textit{W}}here, \textbf{\textit{W}}hat and \textbf{\textit{W}}hether problem directing against pedestrian localization, scale prediction and classification correspondingly. Specifically, for a pedestrian instance, we formulate its feature by three steps. i) We generate a bird view map, which is naturally free from occlusion issues, and scan all points on it to look for suitable locations for each pedestrian instance. ii) Instead of utilizing pre-fixed anchors, we model the interdependency between depth and scale aiming at generating depth-guided scales at different locations for better matching instances of different sizes. iii) We learn a latent vector shared by both visual and corpus space, by which false positives with similar vertical structure but lacking human partial features would be filtered out. We achieve state-of-the-art results on widely used datasets (Citypersons and Caltech). In particular. when evaluating on heavy occlusion subset, our results reduce MR$^{-2}$ from 49.3$\%$ to 18.7$\%$ on Citypersons, and from 45.18$\%$ to 28.33$\%$ on Caltech.