FRCRN：使用频率复发来增强单声道语音增强的提高功能表示

论文标题

FRCRN：使用频率复发来增强单声道语音增强的提高功能表示

FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement

论文作者

Zhao, Shengkui, Ma, Bin, Watcharasupat, Karn N., Gan, Woon-Seng

论文摘要

卷积复发网络（CRN）整合了卷积编码器 - 编码器（CED）结构和经常性结构的结构，已经实现了单声道语音增强的有希望的表现。但是，由于CED的卷积中的接收场有限，跨频率上下文的特征表示受到高度限制。在本文中，我们提出了卷积复发编码器（CRED）结构，以沿频率轴增强特征表示。 CRED在每次卷积之后沿频率轴上沿3D卷积特征图上应用频率复发，因此，它能够捕获远程频率相关性并增强语音输入的特征表示。提出的频率复发是使用前馈顺序存储网络（FSMN）有效实现的。除了信用外，我们还插入编码器和解码器之间的两个堆叠的FSMN层，以模拟进一步的时间动力学。我们将提出的框架命名为频率循环CRN（FRCRN）。我们设计了FRCRN，以预测复杂值域中的复杂理想比率掩模（CIRM），并使用时频域和时域损耗来优化FRCRN。我们提出的方法在宽带基准测试数据集上实现了最先进的性能，并在ICASSP 2022深度抑制（DNS）挑战（https：//github.com/github.com/modelscope/clearearevoice-copecope/clearevoice-setio）中，在ICASSP 2022 2022 DEED噪声抑制（DNS）中获得了实时的成熟赛道的第二名。

Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder (CRED) structure to boost feature representation along the frequency axis. The CRED applies frequency recurrence on 3D convolutional feature maps along the frequency axis following each convolution, therefore, it is capable of catching long-range frequency correlations and enhancing feature representations of speech inputs. The proposed frequency recurrence is realized efficiently using a feedforward sequential memory network (FSMN). Besides the CRED, we insert two stacked FSMN layers between the encoder and the decoder to model further temporal dynamics. We name the proposed framework as Frequency Recurrent CRN (FRCRN). We design FRCRN to predict complex Ideal Ratio Mask (cIRM) in complex-valued domain and optimize FRCRN using both time-frequency-domain and time-domain losses. Our proposed approach achieved state-of-the-art performance on wideband benchmark datasets and achieved 2nd place for the real-time fullband track in terms of Mean Opinion Score (MOS) and Word Accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression (DNS) challenge (https://github.com/modelscope/ClearerVoice-Studio).

下载PDF全文

下载文献需遵守相关版权规定

论文标题