端到端扩张的变异自动编码器具有瓶颈判别性损失的声音变形 - 初步研究

论文标题

端到端扩张的变异自动编码器具有瓶颈判别性损失的声音变形 - 初步研究

End-To-End Dilated Variational Autoencoder with Bottleneck Discriminative Loss for Sound Morphing -- A Preliminary Study

论文作者

Lionello, Matteo, Purwins, Hendrik

论文摘要

我们提出了一项关于端到端变化自动编码器（VAE）进行声音变形的初步研究。比较了两个VAE变体：具有扩张层的VAE（DC-VAE）和VAE仅具有常规卷积层（CC-VAE）。我们结合了以下损失函数：1）重建输入信号的时域均方误差，2）kullback-leibler差异与瓶颈层中标准正态分布的分布，以及3）分类损失从瓶颈表示计算得出的。在口头数字数据库中，我们使用1-纽约的邻居分类表明，声音类别在瓶颈层中分开。我们介绍了MEL频率CEPSTRUM系数动态时间扭曲（MFCC-DTW）偏差，以衡量VAE解码器在Lot（瓶颈）层中的类中心向音频域中该类声音的中心投射出中心。在MFCC-DTW偏差和1-NN分类方面，DC-VAE的表现优于CC-VAE。我们的参数化和数据集的这些结果表明，与CC-VAE相比，DC-VAE更适合于声音变形，因为DC-VAE解码器在从音频域到潜在空间时可以更好地保留拓扑。给出了示例，以形成语音数字和鼓声。

We present a preliminary study on an end-to-end variational autoencoder (VAE) for sound morphing. Two VAE variants are compared: VAE with dilation layers (DC-VAE) and VAE only with regular convolutional layers (CC-VAE). We combine the following loss functions: 1) the time-domain mean-squared error for reconstructing the input signal, 2) the Kullback-Leibler divergence to the standard normal distribution in the bottleneck layer, and 3) the classification loss calculated from the bottleneck representation. On a database of spoken digits, we use 1-nearest neighbor classification to show that the sound classes separate in the bottleneck layer. We introduce the Mel-frequency cepstrum coefficient dynamic time warping (MFCC-DTW) deviation as a measure of how well the VAE decoder projects the class center in the latent (bottleneck) layer to the center of the sounds of that class in the audio domain. In terms of MFCC-DTW deviation and 1-NN classification, DC-VAE outperforms CC-VAE. These results for our parametrization and our dataset indicate that DC-VAE is more suitable for sound morphing than CC-VAE, since the DC-VAE decoder better preserves the topology when mapping from the audio domain to the latent space. Examples are given both for morphing spoken digits and drum sounds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题