使用预定的自我监督学习模型的表示形式改善了声学到公告的反演

论文标题

使用预定的自我监督学习模型的表示形式改善了声学到公告的反演

Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models

论文作者

Udupa, Sathvik, C, Siddarth, Ghosh, Prasanta Kumar

论文摘要

在这项工作中，我们研究了经过预定的自我监督学习（SSL）特征的有效性，以学习声学到发音反演的映射（AAI）。基于信号处理的声学特征（例如MFCC）主要用于具有深层神经网络的AAI任务。凭借SSL功能在其他各种语音任务（例如语音识别，情感分类等）方面运作良好，我们实验了其对AAI的功效。我们使用3种不同模型复杂性的基于变压器神经网络的AAI模型训练SSL功能，并将其在主题特异性（SS）中的MFCC进行比较，汇总和微调（FT）配置与10个受试者的数据进行比较，并在Unseseen句子测试集中与相关系数（CC）评分进行评估。我们发现，基于声学特征重建目标的SSL功能（例如TERA和DecoAR）适合AAI，这些SSL功能的SS CC达到了MFCC的最佳FT CC。我们还发现结果在不同模型大小之间保持一致。

In this work, we investigate the effectiveness of pretrained Self-Supervised Learning (SSL) features for learning the mapping for acoustic to articulatory inversion (AAI). Signal processing-based acoustic features such as MFCCs have been predominantly used for the AAI task with deep neural networks. With SSL features working well for various other speech tasks such as speech recognition, emotion classification, etc., we experiment with its efficacy for AAI. We train on SSL features with transformer neural networks-based AAI models of 3 different model complexities and compare its performance with MFCCs in subject-specific (SS), pooled and fine-tuned (FT) configurations with data from 10 subjects, and evaluate with correlation coefficient (CC) score on the unseen sentence test set. We find that acoustic feature reconstruction objective-based SSL features such as TERA and DeCoAR work well for AAI, with SS CCs of these SSL features reaching close to the best FT CCs of MFCC. We also find the results consistent across different model sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题