软校正：自动语音识别的软检测纠正错误检测

论文标题

软校正：自动语音识别的软检测纠正错误检测

SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition

论文作者

Leng, Yichong, Tan, Xu, Liu, Wenjie, Song, Kaitao, Wang, Rui, Li, Xiang-Yang, Qin, Tao, Lin, Edward, Liu, Tie-Yan

论文摘要

自动语音识别（ASR）中的错误校正旨在纠正ASR模型生成的句子中的这些错误的单词。由于最近的ASR模型通常具有较低的单词错误率（WER），以避免影响最初正确的令牌，因此错误校正模型只能修改错误的单词，因此检测错误的单词对于错误校正很重要。以前的误差校正作品可以通过目标源关注或CTC（Connectionist暂时分类）丢失或明确定位特定的删除/替换/插入错误来隐式检测错误词。但是，隐式误差检测并未提供明确的信号，表明哪些令牌是不正确的，并且明确的错误检测受到较低的检测精度。在本文中，我们提出了具有软误差检测机制的软校正，以避免显式和隐式误差检测的局限性。具体而言，我们首先通过专门设计的语言模型产生的概率检测令牌是否正确，然后设计受约束的CTC损失，该损失仅复制检测到的不正确令牌，以使解码器专注于误差令牌的校正。与CTC丢失的隐式误差检测相比，SoftCorrect提供了明确的信号，以表明哪些单词不正确，因此不需要复制每个令牌，而仅复制不正确的令牌。与显式误差检测相比，SoftCorrect无法检测到特定的缺失/替代/插入误差，而只是将其置于CTC丢失中。 Aishell-1和Aidatatang数据集的实验表明，软校正分别达到26.1％和9.4％的CER降低，表现优于先前的工作，同时仍然享受平行生成的快速速度。

Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题