关于通过生成对抗网络使用音频指纹功能来增强语音

论文标题

关于通过生成对抗网络使用音频指纹功能来增强语音

On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network

论文作者

Faraji, Farnood, Attabi, Yazid, Champagne, Benoit, Zhu, Wei-Ping

论文摘要

语音增强中基于学习的方法的出现恢复了对可靠和可靠的培训功能的需求，这些功能可以紧凑地代表语音信号，同时保留其重要信息。在许多方法中，优选的时频域特征，例如短期傅立叶变换（STFT）和MEL频率sepstral系数（MFCC）。尽管MFCC提供了紧凑的表示，但它们忽略了每个MEL级子带中能量的动力学和分布。在这项工作中，实施了基于生成对抗网络（GAN）的语音增强系统，并结合了从MFCC和标准化的光谱亚带（NSSC）获得的音频指纹（AFP）功能的组合。 NSSC捕获了语音共振体的位置，并以关键的方式补充了MFCC。在具有不同扬声器和噪音类型的实验中，基于GAN的语音增强功能与拟议的AFP功能组合可以实现最佳的客观性能，同时减少了记忆要求和训练时间。

The advent of learning-based methods in speech enhancement has revived the need for robust and reliable training features that can compactly represent speech signals while preserving their vital information. Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC), are preferred in many approaches. While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband. In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of Audio FingerPrinting (AFP) features obtained from the MFCC and the Normalized Spectral Subband Centroids (NSSC). The NSSC capture the locations of speech formants and complement the MFCC in a crucial way. In experiments with diverse speakers and noise types, GAN-based speech enhancement with the proposed AFP feature combination achieves the best objective performance while reducing memory requirements and training time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题