通过对比度学习在分解的层次变化自动编码器中使用对比度学习改进的删除语音表示

论文标题

通过对比度学习在分解的层次变化自动编码器中使用对比度学习改进的删除语音表示

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

论文作者

Xie, Yuying, Arildsen, Thomas, Tan, Zheng-Hua

论文摘要

利用说话者身份和内容在不同的时间尺度上有所不同的事实，\ acrlong {fhvae}（\ acrshort {fhvae}）使用不同的潜在变量来象征这两个属性。这些属性的分离是通过相应潜在变量的不同先前设置进行的。对于先前的说话者身份变量，\ acrshort {fhvae}假设它是一个高斯分布，具有不同的均值和固定差异。通过设置一个较小的固定差异，培训过程可以在一个接近其先前的平均值的一个话语聚会中促进身份变量。但是，这一约束相对较弱，因为话语之间的先前变化的平均值。因此，我们将对比度学习引入\ acrshort {fhvae}框架中，以使说话者身份变量在代表同一说话者时聚集，同时尽可能与其他说话者的疏远。在这项工作中，模型结构尚未更改，但仅在培训过程中进行了更改，因此在测试过程中不需要额外的成本。语音转换已被选为本文的应用。潜在变量评估包括说话者身份变量的说话者验证和识别，以及内容变量的语音识别。此外，对语音转换性能的评估是基于假语音检测实验的。结果表明，与\ acrshort {fhvae}相比，所提出的方法提高了说话者的身份和内容特征提取，并且在转换时具有比基线更好的性能。

Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort{fhvae} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been changed in this work but only the training process, thus no additional cost is needed during testing. Voice conversion has been chosen as the application in this paper. Latent variable evaluations include speaker verification and identification for the speaker identity variable, and speech recognition for the content variable. Furthermore, assessments of voice conversion performance are on the grounds of fake speech detection experiments. Results show that the proposed method improves both speaker identity and content feature extraction compared to \acrshort{fhvae}, and has better performance than baseline on conversion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题