论文标题
Y-vector:扬声器嵌入的多尺度波形编码器
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding
论文作者
论文摘要
最先进的与文本无关的扬声器验证系统通常使用cepstral特征或滤清器库能量作为语音功能。最近的研究试图直接从原始波形中提取说话者嵌入,并显示出竞争性的结果。在本文中,我们提出了一种新型的多尺度波形编码器,该编码器使用三个具有不同时间尺度的卷积分支来计算波形的语音特征。然后,这些功能通过挤压和兴奋块,多级特征聚合器以及时间延迟的神经网络(TDNN)来处理这些功能,以计算扬声器嵌入。我们表明,所提出的嵌入方式优于现有的基于原始波形的扬声器嵌入在扬声器验证上的嵌入。对学习过滤器的进一步分析表明,多尺度编码器在其不同的尺度上关注不同的频段,同时与任何单个尺度对应物相比,总体频率响应更平坦。
State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features. Recent studies attempted to extract speaker embeddings directly from raw waveforms and have shown competitive results. In this paper, we propose a novel multi-scale waveform encoder that uses three convolution branches with different time scales to compute speech features from the waveform. These features are then processed by squeeze-and-excitation blocks, a multi-level feature aggregator, and a time delayed neural network (TDNN) to compute speaker embedding. We show that the proposed embeddings outperform existing raw-waveform-based speaker embeddings on speaker verification by a large margin. A further analysis of the learned filters shows that the multi-scale encoder attends to different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.