论文标题
有监督的说话者认可
Supervised attention for speaker recognition
论文作者
论文摘要
最近提出的自我竞争合并(SAP)在几种演讲者识别系统中表现出良好的性能。在SAP系统中,上下文向量是端对端训练的,并与特征提取器一起进行,其中上下文向量的作用是为说话者识别选择最歧视的帧。但是,与某些情况下的时间平均池基线(TAP)基线相比,SAP表现不佳,这意味着在端到端训练中没有有效地学习注意力。为了解决这个问题,我们介绍了以有监督的方式培训注意力机制的策略,该策略使用分类样本来学习上下文向量。通过我们提出的方法,可以提高上下文向量以选择最有用的框架。我们表明,我们的方法在各种实验设置中都优于现有方法,包括简短的说话者识别,并在Voxceleb数据集上的现有基线中实现竞争性能。
The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.