学习在野外进行人验证的视听嵌入

论文标题

学习在野外进行人验证的视听嵌入

Learning Audio-Visual embedding for Person Verification in the Wild

论文作者

Sun, Peiwen, Zhang, Shanshan, Liu, Zishan, Yuan, Yougen, Zhang, Taotao, Zhang, Honggang, Hu, Pengfei

论文摘要

已经观察到，视听嵌入比嵌入人验证的单模式嵌入更强大。在这里，我们提出了一种新颖的视听策略，该策略从融合的角度考虑聚合器。首先，我们首次在面对面验证中引入了体重增强的细心统计。我们发现在合并过程中的模态之间存在很强的相关性，因此提出了联合关注的合并，其中包含循环一致性以学习隐式框架间的重量。最后，每种方式都与封闭的注意机制融合在一起，以获得强大的视听嵌入。所有提出的模型均在Voxceleb2 Dev数据集上进行培训，最佳系统分别在Voxceleb1的三个正式试验列表中获得了0.18％，0.27％和0.49％的EER，据我们所知，这是对人验证的最佳发布结果。

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题