论文标题

学习在野外进行人验证的视听嵌入

Learning Audio-Visual embedding for Person Verification in the Wild

论文作者

Sun, Peiwen, Zhang, Shanshan, Liu, Zishan, Yuan, Yougen, Zhang, Taotao, Zhang, Honggang, Hu, Pengfei

论文摘要

已经观察到,视听嵌入比嵌入人验证的单模式嵌入更强大。在这里,我们提出了一种新颖的视听策略,该策略从融合的角度考虑聚合器。首先,我们首次在面对面验证中引入了体重增强的细心统计。我们发现在合并过程中的模态之间存在很强的相关性,因此提出了联合关注的合并,其中包含循环一致性以学习隐式框架间的重量。最后,每种方式都与封闭的注意机制融合在一起,以获得强大的视听嵌入。所有提出的模型均在Voxceleb2 Dev数据集上进行培训,最佳系统分别在Voxceleb1的三个正式试验列表中获得了0.18%,0.27%和0.49%的EER,据我们所知,这是对人验证的最佳发布结果。

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源