论文标题
朝着快速准确的端到端ASR流媒体
Towards Fast and Accurate Streaming End-to-End ASR
论文作者
论文摘要
端到端(E2E)模型将常规语音识别模型的声学,发音和语言模型折叠成比常规ASR系统少得多的一个神经网络,从而使其适用于eve件应用程序。例如,作为流e2e模型的复发性神经网络传感器(RNN-T)显示出对设备ASR的有希望的潜力。对于此类应用,质量和延迟是两个关键因素。我们建议通过扩展RNN-T ENDPOINTER(RNN-T EP)模型来减少E2E模型的延迟。通过进一步应用最低单词错误率(MWER)培训技术,我们在语音搜索测试集上实现了8.0%的相对单词错误率(WER)降低和130ms 90毫秒的延迟。我们还尝试了二线聆听,参加和咒语(LAS)访问者。尽管它没有直接改善第一通行的延迟,但大大减少为延迟提供了额外的交易空间。与原始提议的RNN-T EP模型相比,RNN-T EP+LAS与MWER训练相比,相对减少的相对降低为18.7%,160ms 90毫秒的潜伏期减少。
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNN-T) as a streaming E2E model has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer . Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP model.