CTC触发了带有空间液位辍学的暹罗网络以进行语音识别

论文标题

CTC触发了带有空间液位辍学的暹罗网络以进行语音识别

A CTC Triggered Siamese Network with Spatial-Temporal Dropout for Speech Recognition

论文作者

Gao, Yingying, Feng, Junlan, Wang, Tianrui, Deng, Chao, Zhang, Shilei

论文摘要

暹罗网络在无监督的视觉表示学习中显示出有效的结果。这些模型旨在通过最大化它们的相似性来学习两种增强的不变表示。在本文中，我们提出了一个有效的暹罗网络，以提高端到端自动语音识别（ASR）的鲁棒性。我们介绍了时空辍学，以支持对暹罗-ASR框架进行更暴力的骚扰。此外，我们还放宽了相似性正则化，以最大程度地提高连接派时间分类（CTC）尖峰的帧上分布的相似性而不是在所有峰值上发生。在两个基准Aishell-1和LibrisPeech上评估了所提出的体系结构的效率，分别降低了7.13％和6.59％的相对字符错误率（CER）和单词错误率（WER）降低。分析表明，我们提出的方法为受过训练的模型带来了更好的统一性，并显然扩大了CTC尖峰。

Siamese networks have shown effective results in unsupervised visual representation learning. These models are designed to learn an invariant representation of two augmentations for one input by maximizing their similarity. In this paper, we propose an effective Siamese network to improve the robustness of End-to-End automatic speech recognition (ASR). We introduce spatial-temporal dropout to support a more violent disturbance for Siamese-ASR framework. Besides, we also relax the similarity regularization to maximize the similarities of distributions on the frames that connectionist temporal classification (CTC) spikes occur rather than on all of them. The efficiency of the proposed architecture is evaluated on two benchmarks, AISHELL-1 and Librispeech, resulting in 7.13% and 6.59% relative character error rate (CER) and word error rate (WER) reductions respectively. Analysis shows that our proposed approach brings a better uniformity for the trained model and enlarges the CTC spikes obviously.

下载PDF全文

下载文献需遵守相关版权规定

论文标题