大规模学习说话者认可的广义表示

论文标题

大规模学习说话者认可的广义表示

Large-scale learning of generalised representations for speaker recognition

论文作者

Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Lee, Jaesong, Shim, Hye-jin, Kwon, Youngki, Chung, Joon Son, Watanabe, Shinji

论文摘要

这项工作的目的是开发一种说话者识别模型，用于在不同的情况下使用。我们假设应充分配置两个组件以构建这种模型。首先，需要足够的架构。我们探索了一些最新的最新模型，包括ECAPA-TDNN和MFA-CONFORMER以及其他基线。其次，需要大量数据。我们研究了几个新的培训数据配置，结合了一些现有数据集。最广泛的配置包括超过87K扬声器的10.22k小时的语音。采用四种评估协议来衡量受过训练的模型在不同情况下的表现。通过实验，我们发现具有最小感应性偏见的MFA构造器最佳。我们还表明，提出的大数据配置培训可以更好地性能。观察到概括的提升，其中四个评估方案的平均性能提高了20％以上。此外，我们还证明，在增加容量时，这些模型的性能可以进一步改善。

The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be required. We investigate several new training data configurations combining a few existing datasets. The most extensive configuration includes over 87k speakers' 10.22k hours of speech. Four evaluation protocols are adopted to measure how the trained model performs in diverse scenarios. Through experiments, we find that MFA-Conformer with the least inductive bias generalises the best. We also show that training with proposed large data configurations gives better performance. A boost in generalisation is observed, where the average performance on four evaluation protocols improves by more than 20%. In addition, we also demonstrate that these models' performances can improve even further when increasing capacity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题