论文标题
音频时间尺度修改与时间压缩网络
Audio Time-Scale Modification with Temporal Compressing Networks
论文作者
论文摘要
我们提出了一种新颖的方法来修改音频信号。与依赖框架技术或短期傅立叶变换以保持时间伸展过程中的频率的传统方法不同,我们的神经网络模型将原始音频编码为高级潜在表示,称为Nealgram,其中每个矢量代表1024个音频示例点。由于足够的压缩比,我们能够应用神经图的任意空间插值来执行时间拉伸。最后,一种学识渊博的神经解码器根据拉伸神经图表示,合成了时间缩放的音频样本。编码器和解码器均经过潜在的回归损失和对抗性损失的训练,以获得高保真的音频样本。尽管它很简单,但与现有基线相比,我们的方法具有可比性的性能,并为现代时间尺度修改的研究开辟了新的可能性。可以在https://tsmnet-mmasia23.github.io上找到音频样本。
We propose a novel approach for time-scale modification of audio signals. Unlike traditional methods that rely on the framing technique or the short-time Fourier transform to preserve the frequency during temporal stretching, our neural network model encodes the raw audio into a high-level latent representation, dubbed Neuralgram, where each vector represents 1024 audio sample points. Due to a sufficient compression ratio, we are able to apply arbitrary spatial interpolation of the Neuralgram to perform temporal stretching. Finally, a learned neural decoder synthesizes the time-scaled audio samples based on the stretched Neuralgram representation. Both the encoder and decoder are trained with latent regression losses and adversarial losses in order to obtain high-fidelity audio samples. Despite its simplicity, our method has comparable performance compared to the existing baselines and opens a new possibility in research into modern time-scale modification. Audio samples can be found at https://tsmnet-mmasia23.github.io