尖峰触发的非自动回归变压器，用于端到端语音识别

论文标题

尖峰触发的非自动回归变压器，用于端到端语音识别

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

论文作者

Tian, Zhengkun, Yi, Jiangyan, Tao, Jianhua, Bai, Ye, Zhang, Shuai, Wen, Zhengqi

论文摘要

非自动回旋变压器模型在神经机器翻译中与自回归序列到序列模型的推理速度和可比性能非常快。大多数非自动性变形金刚从预定义的掩码序列解码了目标序列。如果预定义的长度太长，则会导致许多冗余计算。如果预定义的长度比目标序列的长度短，则会损害模型的性能。为了解决此问题并提高推理速度，我们提出了一个尖峰触发的非入学变压器模型，用于端到端语音识别，该模型引入了CTC模块，以预测目标序列的长度并加速收敛。所有实验均在中国公共普通话数据集Aishell-1上进行。结果表明，提出的模型可以准确预测目标序列的长度，并通过高级变压器实现竞争性能。此外，该模型甚至达到了0.0056的实时因素，超过了所有主流语音识别模型。

Non-autoregressive transformer models have achieved extremely fast inference speed and comparable performance with autoregressive sequence-to-sequence models in neural machine translation. Most of the non-autoregressive transformers decode the target sequence from a predefined-length mask sequence. If the predefined length is too long, it will cause a lot of redundant calculations. If the predefined length is shorter than the length of the target sequence, it will hurt the performance of the model. To address this problem and improve the inference speed, we propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence. All the experiments are conducted on a public Chinese mandarin dataset AISHELL-1. The results show that the proposed model can accurately predict the length of the target sequence and achieve a competitive performance with the advanced transformers. What's more, the model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题