使用解开的通道因子和神经波形模型增强低质量的语音记录

论文标题

使用解开的通道因子和神经波形模型增强低质量的语音记录

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

论文作者

Li, Haoyu, Ai, Yang, Yamagishi, Junichi

论文摘要

高质量的语音语料库是大多数语音应用的重要基础。但是，由于在专业录制环境中收集了这些语音数据，因此这些语音数据很昂贵且有限。在这项工作中，我们提出了一个编码器神经网络，以自动提高专业高质量录音的低质量记录。为了解决通道可变性，我们首先使用具有对抗性训练的编码器网络从原始输入音频中滤除了通道特性。接下来，我们将通道因子从参考音频中解开。在此因素的条件下，自动回归解码器随后用于预测目标 - 环境MEL频谱图。最后，我们应用神经声码器来综合语音波形。实验结果表明，在设置高质量音频作为参考时，提出的系统可以生成专业的高质量语音波形。与几个最先进的基线系统相比，它还提高了语音增强性能。

High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out the channel characteristics from the original input audio using the encoder network with adversarial training. Next, we disentangle the channel factor from a reference audio. Conditioned on this factor, an auto-regressive decoder is then used to predict the target-environment Mel spectrogram. Finally, we apply a neural vocoder to synthesize the speech waveform. Experimental results show that the proposed system can generate a professional high-quality speech waveform when setting high-quality audio as the reference. It also improves speech enhancement performance compared with several state-of-the-art baseline systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题