论文标题
Hifi-Wavegan:具有辅助光谱图的生成对抗网络,用于高保真唱歌语音生成
HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
论文作者
论文摘要
面向娱乐的唱歌语音综合(SVS)需要一个声码器来产生高保真性(例如48kHz)音频。但是,在这种情况下,大多数文本到语音(TTS)声码编码器不能很好地重建波形。在本文中,我们建议Hifi-Wavegan实时综合48kHz高质量的歌声。具体而言,它由用作发电机的扩展波纳特,在Hifigan中提出的多周期歧视器以及从Univnet借用的多分辨率频谱歧视器。为了更好地从全频段MEL光谱图中重建高频部分,我们结合了脉冲提取器,以生成对合成波形的约束。另外,辅助光谱相损失被用于进一步近似实际分布。实验结果表明,我们提出的HIFI-WAVEGAN在48KHz SVS任务的平均意见评分(MOS)度量中获得4.23,这显着优于其他神经声码器。
Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, an auxiliary spectrogram-phase loss is utilized to approximate the real distribution further. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.