Hifi-Wavegan：具有辅助光谱图的生成对抗网络，用于高保真唱歌语音生成

论文标题

Hifi-Wavegan：具有辅助光谱图的生成对抗网络，用于高保真唱歌语音生成

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

论文作者

Wang, Chunhui, Zeng, Chang, Chen, Jun, He, Xing

论文摘要

面向娱乐的唱歌语音综合（SVS）需要一个声码器来产生高保真性（例如48kHz）音频。但是，在这种情况下，大多数文本到语音（TTS）声码编码器不能很好地重建波形。在本文中，我们建议Hifi-Wavegan实时综合48kHz高质量的歌声。具体而言，它由用作发电机的扩展波纳特，在Hifigan中提出的多周期歧视器以及从Univnet借用的多分辨率频谱歧视器。为了更好地从全频段MEL光谱图中重建高频部分，我们结合了脉冲提取器，以生成对合成波形的约束。另外，辅助光谱相损失被用于进一步近似实际分布。实验结果表明，我们提出的HIFI-WAVEGAN在48KHz SVS任务的平均意见评分（MOS）度量中获得4.23，这显着优于其他神经声码器。

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, an auxiliary spectrogram-phase loss is utilized to approximate the real distribution further. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.

下载PDF全文

下载文献需遵守相关版权规定

论文标题