论文标题
STFT结构域神经语音增强,并具有非常低的算法潜伏期
STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency
论文作者
论文摘要
在短期傅立叶变换(STFT)域中,基于深度学习的语音增强通常使用较大的窗口长度,例如32 ms。较大的窗口可以导致更高的频率分辨率,并有可能更好地增强。然而,这在线设置中会产生32毫秒的算法延迟,因为也使用相同的窗口大小执行了倒数STFT(ISTFT)中使用的重叠ADD算法。为了减少这种固有的延迟,我们适应了常规的双窗口大小的方法,其中常规输入窗口大小用于STFT,但将较短的输出窗口用于重叠式ADD,用于基于STFT域深度学习的基于深度学习的框架在线语音语音增强。基于这种STFT-ISTFT配置,我们采用复杂的光谱映射进行框架在线增强,其中对深度神经网络(DNN)进行了训练,以预测来自混合物RI组件的目标语音的真实和虚构(RI)组件。此外,我们使用DNN预测的RI组件来进行框架在线边界成形,其结果被用作第二个DNN的额外功能,以执行在线过滤后进行框架。频域波束形式可以轻松地与我们的DNN集成,并且旨在不产生任何算法延迟。此外,我们提出了一种未来的预测技术,以进一步降低算法延迟。对吵闹的语音增强的评估显示了所提出的算法的有效性。与Conv-TASNET相比,我们的STFT域系统可以在可比的计算中获得更好的增强性能,或者以较少的计算性能获得可比性的性能,从而在算法潜伏期低至2 ms的算法延迟下保持强劲的性能。
Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the same window size. To reduce this inherent latency, we adapt a conventional dual-window-size approach, where a regular input window size is used for STFT but a shorter output window is used for overlap-add, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT-iSTFT configuration, we employ complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the DNN-predicted RI components to conduct frame-online beamforming, the results of which are used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamformer can be easily integrated with our DNNs and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation on noisy-reverberant speech enhancement shows the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.