论文标题
通过共同信息最小化,散布的说话者表示学习
Disentangled Speaker Representation Learning via Mutual Information Minimization
论文作者
论文摘要
由说话者无关的功能引起的领域不匹配问题一直是说话者识别的主要主题。在本文中,我们提出了一个明确的解开框架,以通过相互信息(MI)最小化从与说话者无关的功能中脱离扬声器相关的功能。为了实现我们在与说话者相关的和扬声器无关的功能之间最大程度地减少MI的目标,我们采用了对比的对数比率上限(俱乐部),从而利用了MI的上限。我们的框架是在三阶段结构中构建的。首先,在前端编码器中,输入语音被编码为共享的初始嵌入。接下来,在脱钩块中,共享的初始嵌入被分为与扬声器相关的单独的且与扬声器无关的嵌入。最后,通过MI最小化在最后阶段进行了分解。关于远场演讲者验证挑战2022(FFSVC2022)的实验表明,我们提出的框架对分解有效。另外,要利用包含众多扬声器的域名数据集,我们使用Voxceleb数据集对前端编码器进行了预先培训。然后,我们使用FFSVC 2022数据集微调了扬声器嵌入模型。实验结果表明,在现有预训练模型上使用分离框架框架进行微调是有效的,并且可以进一步提高性能。
Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.