论文标题
FastEmit:具有序列级发射正则化的低延迟流动ASR
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
论文作者
论文摘要
流动语音识别(ASR)旨在尽快,准确地发射每个假设的单词。但是,通过单词错误率(WER)衡量而不会降低质量而不会降低质量的快速发射是高度挑战的。现有的方法在内,包括早期和晚期的惩罚和约束对齐方式通过操纵序列传感器模型中的人均或人均概率预测来惩罚排放延迟。在成功减少延迟的同时,这些方法遭受了明显的准确回归,并且还需要现有模型的其他单词对齐信息。在这项工作中,我们提出了一种序列级排放正则化方法,该方法称为fastemit,该方法将延迟正则化直接应用于训练传感器模型中的每序列概率,并且不需要任何一致性。我们证明,FastEmit更适合通过将其应用于包括RNN-TransDucer,Transformer-Transverer-TransDucer,Convnet-Transducer和Conformer-TransDucer在内的各种端到端流媒体网络(包括RNN-TransDucer,Transformer-Transducer,Convernet-Transducer),更适合用于流式ASR的传感器模型的序列级别优化。在语音搜索测试集中,我们实现了150-300毫秒的延迟降低,比以前的技术要高得多。 FastEmit还将流媒体ASR准确度从4.4%/8.9%提高到3.1%/7.5%的WER,同时在LibrisPeech中将90个百分位延迟从210毫秒降低到仅30 ms。
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the sequence-level optimization of transducer models for streaming ASR by applying it on various end-to-end streaming ASR networks including RNN-Transducer, Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve 150-300 ms latency reduction with significantly better accuracy over previous techniques on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.