基于盲源分离和基于X-vector的扬声器选择的目标语音提取受到数据增强的培训

论文标题

基于盲源分离和基于X-vector的扬声器选择的目标语音提取受到数据增强的培训

Target Speech Extraction Based on Blind Source Separation and X-vector-based Speaker Selection Trained with Data Augmentation

论文作者

Gu, Zhaoyi, Liao, Lele, Chen, Kai, Lu, Jing

论文摘要

从混合物中提取所需的语音是一项有意义且具有挑战性的任务。基于端到端DNN的方法虽然很有吸引力，但仍面临着概括的问题。在本文中，我们通过将盲源分离（BSS）与基于X-Vector的扬声器识别（SR）模块相结合，探讨了目标语音提取的顺序方法。利用并比较了两种基于源独立性假设，独立的低级矩阵分析（ILRMA）和多通道变异自动编码器（MVAE）的有希望的BSS方法。 ILRMA采用非负矩阵分解（NMF）来捕获源信号的光谱结构，MVAE利用了深神经网络（DNN）的强建模能力。但是，对MVAE的调查仅限于培训，很少有演讲者，通常包括测试扬声器的语音信号。我们使用500名演讲者的清洁语音信号扩展了MVAE的培训，以评估其对看不见的说话者的概括。为了提高正确的提取率，实施了两种数据增强策略来训练SR模块。通过在各种环境下使用真实的房间冲动响应构建的测试数据，研究了所提出的级联方法的性能。

Extracting the desired speech from a mixture is a meaningful and challenging task. The end-to-end DNN-based methods, though attractive, face the problem of generalization. In this paper, we explore a sequential approach for target speech extraction by combining blind source separation (BSS) with the x-vector based speaker recognition (SR) module. Two promising BSS methods based on source independence assumption, independent low-rank matrix analysis (ILRMA) and multi-channel variational autoencoder (MVAE), are utilized and compared. ILRMA employs nonnegative matrix factorization (NMF) to capture spectral structures of source signals and MVAE utilizes the strong modeling power of deep neural networks (DNN). However, the investigation of MVAE has been limited to the training with very few speakers and the speech signals of test speakers are usually included. We extend the training of MVAE using clean speech signals of 500 speakers to evaluate its generalization to unseen speakers. To improve the correct extraction rate, two data augmentation strategies are implemented to train the SR module. The performance of the proposed cascaded approach is investigated with test data constructed with real room impulse responses under varied environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题