STFT结构域神经语音增强，并具有非常低的算法潜伏期

论文标题

STFT结构域神经语音增强，并具有非常低的算法潜伏期

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

论文作者

Wang, Zhong-Qiu, Wichern, Gordon, Watanabe, Shinji, Roux, Jonathan Le

论文摘要

在短期傅立叶变换（STFT）域中，基于深度学习的语音增强通常使用较大的窗口长度，例如32 ms。较大的窗口可以导致更高的频率分辨率，并有可能更好地增强。然而，这在线设置中会产生32毫秒的算法延迟，因为也使用相同的窗口大小执行了倒数STFT（ISTFT）中使用的重叠ADD算法。为了减少这种固有的延迟，我们适应了常规的双窗口大小的方法，其中常规输入窗口大小用于STFT，但将较短的输出窗口用于重叠式ADD，用于基于STFT域深度学习的基于深度学习的框架在线语音语音增强。基于这种STFT-ISTFT配置，我们采用复杂的光谱映射进行框架在线增强，其中对深度神经网络（DNN）进行了训练，以预测来自混合物RI组件的目标语音的真实和虚构（RI）组件。此外，我们使用DNN预测的RI组件来进行框架在线边界成形，其结果被用作第二个DNN的额外功能，以执行在线过滤后进行框架。频域波束形式可以轻松地与我们的DNN集成，并且旨在不产生任何算法延迟。此外，我们提出了一种未来的预测技术，以进一步降低算法延迟。对吵闹的语音增强的评估显示了所提出的算法的有效性。与Conv-TASNET相比，我们的STFT域系统可以在可比的计算中获得更好的增强性能，或者以较少的计算性能获得可比性的性能，从而在算法潜伏期低至2 ms的算法延迟下保持强劲的性能。

Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the same window size. To reduce this inherent latency, we adapt a conventional dual-window-size approach, where a regular input window size is used for STFT but a shorter output window is used for overlap-add, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT-iSTFT configuration, we employ complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the DNN-predicted RI components to conduct frame-online beamforming, the results of which are used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamformer can be easily integrated with our DNNs and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation on noisy-reverberant speech enhancement shows the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题