改进了非自动回归端到端ASR的Mask-CTC

论文标题

改进了非自动回归端到端ASR的Mask-CTC

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

论文作者

Higuchi, Yosuke, Inaguma, Hirofumi, Watanabe, Shinji, Ogawa, Tetsuji, Kobayashi, Tetsunori

论文摘要

为了实现自动语音识别（ASR）的现实部署，该系统希望能够快速推理，同时减轻计算资源的要求。最近提议的端到端ASR系统基于掩盖了连接派时间分类（CTC），Mask-CTC，通过以非自动回忆方式生成令牌来满足这一需求。尽管Mask-CTC达到了非常快的推理速度，但其识别性能却落后于常规自回归（AR）系统的推理速度。为了提高Mask-CTC的性能，我们首先建议通过采用最近提出的称为Conformer的架构来增强编码器网络体系结构。接下来，我们通过引入辅助目标来预测部分目标序列的长度来提出新的训练和解码方法，从而允许模型在推理过程中删除或插入令牌。不同ASR任务的实验结果表明，所提出的方法可显着改善面膜-CTC，超过标准的CTC模型（15.5％$ \ rightarrow $ 9.1％WSJ）。此外，Mask-CTC现在可以在没有推理速度降低的情况下获得竞争成果（使用CPU <$ <$ 0.1 RTF）。我们还展示了Mask-CTC在端到端语音翻译中的潜在应用。

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% $\rightarrow$ 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed ($<$ 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题