论文标题

Fast-U2 ++:关节CTC/注意框架中快速准确的端到端语音识别

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

论文作者

Liang, Chengdong, Zhang, Xiao-Lei, Zhang, BinBin, Wu, Di, Li, Shengqiang, Song, Xingchen, Peng, Zhendong, Pan, Fuping

论文摘要

最近,语音识别的统一流和非流式两通道(U2/U2 ++)端到端模型在流能力,准确性和延迟方面表现出了很好的表现。在本文中,我们提出了Fast-U2 ++,这是U2 ++的增强版本,以进一步降低部分潜伏期。 Fast-U2 ++的核心思想是用一个小块在其编码器中的底层输出部分结果,同时在其编码器的顶层中使用大块,以补偿由小块引起的性能退化。此外,我们使用知识蒸馏方法来减少令牌排放延迟。我们在Aishell-1数据集上介绍了广泛的实验。实验和消融研究表明,与U2 ++相比,FAST-U2 ++将模型潜伏期从320ms降低到80ms,并通过流式设置达到5.06%的字符错误率(CER)。

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy and latency. In this paper, we present fast-U2++, an enhanced version of U2++ to further reduce partial latency. The core idea of fast-U2++ is to output partial results of the bottom layers in its encoder with a small chunk, while using a large chunk in the top layers of its encoder to compensate the performance degradation caused by the small chunk. Moreover, we use knowledge distillation method to reduce the token emission latency. We present extensive experiments on Aishell-1 dataset. Experiments and ablation studies show that compared to U2++, fast-U2++ reduces model latency from 320ms to 80ms, and achieves a character error rate (CER) of 5.06% with a streaming setup.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源