朝着快速准确的端到端ASR流媒体

论文标题

朝着快速准确的端到端ASR流媒体

Towards Fast and Accurate Streaming End-to-End ASR

论文作者

Li, Bo, Chang, Shuo-yiin, Sainath, Tara N., Pang, Ruoming, He, Yanzhang, Strohman, Trevor, Wu, Yonghui

论文摘要

端到端（E2E）模型将常规语音识别模型的声学，发音和语言模型折叠成比常规ASR系统少得多的一个神经网络，从而使其适用于eve件应用程序。例如，作为流e2e模型的复发性神经网络传感器（RNN-T）显示出对设备ASR的有希望的潜力。对于此类应用，质量和延迟是两个关键因素。我们建议通过扩展RNN-T ENDPOINTER（RNN-T EP）模型来减少E2E模型的延迟。通过进一步应用最低单词错误率（MWER）培训技术，我们在语音搜索测试集上实现了8.0％的相对单词错误率（WER）降低和130ms 90毫秒的延迟。我们还尝试了二线聆听，参加和咒语（LAS）访问者。尽管它没有直接改善第一通行的延迟，但大大减少为延迟提供了额外的交易空间。与原始提议的RNN-T EP模型相比，RNN-T EP+LAS与MWER训练相比，相对减少的相对降低为18.7％，160ms 90毫秒的潜伏期减少。

End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNN-T) as a streaming E2E model has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer . Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题