论文标题
使用端到端ASR的声学文本子字表示的音频到大量
Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR
论文作者
论文摘要
准确预测用户意图与设备上的语音助手(VA)互动(例如,在电话上)对于实现与VA的自然主义,引人入胜和以隐私为中心的互动至关重要。为此,我们提出了一种新颖的方法,可以直接从用端到端ASR模型获得的子字代币编码的声学和文本信息来预测用户的意图(用户与设备交谈)。 Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable "sub"-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced词汇空间比全词的空间。要学习用于音频到自然分类的子字表示,我们提取:(i)来自E2E-ASR模型的声学信息,该模型为子字代币提供框架级别的CTC后验概率,以及(ii)从预先训练的连续袋中捕获子字样的语义含义的预训练的连续袋中的文本信息。我们方法的关键是它使用位置编码的概念将声学子词级的后词与文本信息结合在一起,以同时考虑多个ASR假设。我们表明,我们的方法为音频到大量的分类提供了更强大和更丰富的表示,并且非常准确,可以通过以99%的真实正率调用Smart Assistant,从而正确缓解93.3%的意外用户音频。
Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable "sub"-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To learn the subword representations for the audio-to-intent classification, we extract: (i) acoustic information from an E2E-ASR model, which provides frame-level CTC posterior probabilities for the subword tokens, and (ii) textual information from a pre-trained continuous bag-of-words model capturing the semantic meaning of the subword tokens. The key to our approach is the way it combines acoustic subword-level posteriors with text information using the notion of positional-encoding in order to account for multiple ASR hypotheses simultaneously. We show that our approach provides more robust and richer representations for audio-to-intent classification, and is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.