Torchdiva：开源机器学习库建立的语音生产的可扩展计算模型

论文标题

Torchdiva：开源机器学习库建立的语音生产的可扩展计算模型

TorchDIVA: An Extensible Computational Model of Speech Production built on an Open-Source Machine Learning Library

论文作者

Kinahan, Sean, Liss, Julie, Berisha, Visar

论文摘要

Diva模型是语音运动控制的计算模型，将负责语音生产的大脑区域的模拟与人声道模型结合在一起。该模型当前在MATLAB SIMULINK中实现；但是，这并不是理想的，因为在Python中进行了言语技术研究的大部分发展。这意味着有大量的机器学习工具可以在Python生态系统中自由使用，这些工具无法轻易与Diva集成。我们提出了Torchdiva，这是使用Pytorch Tensors在Python中进行完整的Diva重建。 Diva源代码直接从MATLAB转换为Python，并从头开始实现了内置的Simulink信号块。实施后，通过系统的逐块验证评估了每个模块的精度。显示Torchdiva模型可产生与原始Diva模型的输出相匹配的输出，两者之间的差异微不足道。我们还提供了Torchdiva作为研究平台的可扩展性的示例。 Torchdiva中的语音质量增强是通过与现有的Pytorch生成式Vocoder的集成来实现的。经过修改的Diffwave Mel-spectrum Upsampler接受了人类语音波形的训练，并以Torchdiva语音生产为条件。结果表明，与基线相比，差异增强输出的语音质量指标提高了。在原始MATLAB实施中，这种增强将很难或不可能完成。概念验证证明了Torchdiva将带给研究社区的价值。研究人员可以在以下网址下载新实施：https：//github.com/skinahan/diva_pytorch

The DIVA model is a computational model of speech motor control that combines a simulation of the brain regions responsible for speech production with a model of the human vocal tract. The model is currently implemented in Matlab Simulink; however, this is less than ideal as most of the development in speech technology research is done in Python. This means there is a wealth of machine learning tools which are freely available in the Python ecosystem that cannot be easily integrated with DIVA. We present TorchDIVA, a full rebuild of DIVA in Python using PyTorch tensors. DIVA source code was directly translated from Matlab to Python, and built-in Simulink signal blocks were implemented from scratch. After implementation, the accuracy of each module was evaluated via systematic block-by-block validation. The TorchDIVA model is shown to produce outputs that closely match those of the original DIVA model, with a negligible difference between the two. We additionally present an example of the extensibility of TorchDIVA as a research platform. Speech quality enhancement in TorchDIVA is achieved through an integration with an existing PyTorch generative vocoder called DiffWave. A modified DiffWave mel-spectrum upsampler was trained on human speech waveforms and conditioned on the TorchDIVA speech production. The results indicate improved speech quality metrics in the DiffWave-enhanced output as compared to the baseline. This enhancement would have been difficult or impossible to accomplish in the original Matlab implementation. This proof-of-concept demonstrates the value TorchDIVA will bring to the research community. Researchers can download the new implementation at: https://github.com/skinahan/DIVA_PyTorch

下载PDF全文

下载文献需遵守相关版权规定

论文标题