论文标题
端到端ASR的数据增强的半监督学习
Semi-Supervised Learning with Data Augmentation for End-to-End ASR
论文作者
论文摘要
在本文中,我们应用半监督学习(SSL)以及数据增强(DA),以提高端到端ASR的准确性。我们专注于一致性正规化原理,该原理已成功地应用于图像分类任务,并介绍了FixMatch和嘈杂的学生算法的顺序到序列(SEQ2SEQ)版本。具体而言,我们在用DA扰动输入功能后,使用SEQ2SEQ模型生成了未标记数据的伪标签。我们还提出了两种算法的软标签变体,以应对伪标签误差,显示进一步的性能改进。我们在对话语音数据集上进行了SSL实验,其中仅使用25%的原始标签(475H标记的数据)手动转录的培训数据。在结果中,具有软标签和一致性正则化的嘈杂的学生算法在添加475h的未标记数据时降低了10.4%的单词错误率(WER),对应于92%的恢复率。此外,与使用完整标签的训练集(恢复率:78%)相比,迭代添加950h更多未标记的数据时,我们最好的SSL性能在5%以内。
In this paper, we apply Semi-Supervised Learning (SSL) along with Data Augmentation (DA) for improving the accuracy of End-to-End ASR. We focus on the consistency regularization principle, which has been successfully applied to image classification tasks, and present sequence-to-sequence (seq2seq) versions of the FixMatch and Noisy Student algorithms. Specifically, we generate the pseudo labels for the unlabeled data on-the-fly with a seq2seq model after perturbing the input features with DA. We also propose soft label variants of both algorithms to cope with pseudo label errors, showing further performance improvements. We conduct SSL experiments on a conversational speech data set with 1.9kh manually transcribed training data, using only 25% of the original labels (475h labeled data). In the result, the Noisy Student algorithm with soft labels and consistency regularization achieves 10.4% word error rate (WER) reduction when adding 475h of unlabeled data, corresponding to a recovery rate of 92%. Furthermore, when iteratively adding 950h more unlabeled data, our best SSL performance is within 5% WER increase compared to using the full labeled training set (recovery rate: 78%).