流式传输，快速准确的磁性逆文本标准化以自动语音识别

论文标题

流式传输，快速准确的磁性逆文本标准化以自动语音识别

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

论文作者

Gaur, Yashesh, Kibre, Nick, Xue, Jian, Shu, Kangyuan, Wang, Yuhui, Alphanso, Issac, Li, Jinyu, Gong, Yifan

论文摘要

自动语音识别（ASR）系统通常以词汇形式产生产量。但是，人类更喜欢书面形式的输出。为了弥合这一差距，ASR系统通常采用逆文本归一化（ITN）。在先前的工作中，已使用加权有限状态传感器（WFST）进行ITN。 WFST非常适合这项任务，但是它们的规模和运行时间成本可以使嵌入式应用程序的部署具有挑战性。在本文中，我们描述了流媒体，轻巧且准确的设备ITN系统的开发。我们系统的核心是流媒体变压器标记器，它标记ASR的词汇令牌。该标签告知可以应用哪些ITN类别（如果有的话）。在此之后，我们仅在标记的文本上应用一个特定于ITN类别的WFST，以可靠地执行ITN转换。我们表明，所提出的ITN解决方案的性能等效于强基础，同时大小较小并保持定制功能。

Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong baselines, while being significantly smaller in size and retaining customization capabilities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题