推动原始波形扬声器识别的极限

论文标题

推动原始波形扬声器识别的极限

Pushing the limits of raw waveform speaker recognition

论文作者

Jung, Jee-weon, Kim, You Jin, Heo, Hee-Soo, Lee, Bong-Jin, Kwon, Youngki, Chung, Joon Son

论文摘要

近年来，基于原始波形输入的说话者识别系统已受到越来越多的关注。但是，此类系统的性能通常不如最先进的基于功能的功能对应物，在流行的voxceleb1测试集中，其误差率在1％以下。本文提出了一个基于原始波形输入的新型扬声器识别模型。该模型结合了机器学习和扬声器验证方面的最新进展，包括RES2NET主干模块和多层功能聚合。我们的最佳模型达到了0.89％的同等错误率，这与基于手工制作的功能的最先进模型具有竞争力，并且优于基于原始波形输入的最佳模型。我们还探讨了所提出的模型在自我监督的学习框架中的应用。我们的自我监督模型在这一研究中的现有作品优于单阶段的现有作品。最后，我们表明，自我监督的预训练对于半监督场景有效，在半监督场景中，我们只有一组标签的训练数据以及较大的未标记示例。

In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题