论文标题

部分可观测时空混沌系统的无模型预测

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

论文作者

Huang, W. Ronny, Peyser, Cal, Sainath, Tara N., Pang, Ruoming, Strohman, Trevor, Kumar, Shankar

论文摘要

语言模型融合可帮助智能助手识别声学数据中很少见的单词,但在纯文本语料库中很丰富(键入搜索日志)。但是,这样的语料库具有阻碍下游性能的属性,包括(1)太大,(2)困扰域不匹配的内容,以及(3)重头而不是重型尾巴(很多重复的搜索查询,例如“天气”)。我们表明,选择语言建模数据的三种简单策略可以极大地改善稀有单词的识别而不会损害整体性能。首先,为了解决重头体,我们根据软日志功能将数据置于示例,从而可以调节降低高频(头)句子。其次,为了鼓励罕见的暴露,我们明确过滤了声学数据中罕见的单词。最后,我们通过基于困惑的对比选择来解决域 - 不匹配,对与目标域相匹配的示例过滤。我们将大量的Web搜索查询量减少了53倍,并获得比没有下调更好的LM困惑。当使用最先进的生产语音发动机浅融合时,与在RAW语料库中训练的基线LM相比,我们的LM在稀有句子句子上的相对量最高为24%(没有整体上)。通过对现场语音搜索流量进行有利的并排评估,进一步验证了这些收益。

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源