论文标题
学习在野外进行人验证的视听嵌入
Learning Audio-Visual embedding for Person Verification in the Wild
论文作者
论文摘要
已经观察到,视听嵌入比嵌入人验证的单模式嵌入更强大。在这里,我们提出了一种新颖的视听策略,该策略从融合的角度考虑聚合器。首先,我们首次在面对面验证中引入了体重增强的细心统计。我们发现在合并过程中的模态之间存在很强的相关性,因此提出了联合关注的合并,其中包含循环一致性以学习隐式框架间的重量。最后,每种方式都与封闭的注意机制融合在一起,以获得强大的视听嵌入。所有提出的模型均在Voxceleb2 Dev数据集上进行培训,最佳系统分别在Voxceleb1的三个正式试验列表中获得了0.18%,0.27%和0.49%的EER,据我们所知,这是对人验证的最佳发布结果。
It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.