SLUE阶段2：一组不同的口语语言理解任务的基准套件

论文标题

SLUE阶段2：一组不同的口语语言理解任务的基准套件

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

论文作者

Shon, Suwon, Arora, Siddhant, Lin, Chyi-Jiunn, Pasad, Ankita, Wu, Felix, Sharma, Roshan, Wu, Wei-Lun, Lee, Hung-Yi, Livescu, Karen, Watanabe, Shinji

论文摘要

语言理解（SLU）任务已经在语音研究社区中进行了数十年的研究，但没有像语音和言语识别等低级任务那样受到关注。特别是，SLU任务基准不大，许多现有数据都使用所有研究人员都无法自由使用的数据。最近的工作已经开始引入此类基准数据集以进行多个任务。在这项工作中，我们基于可用的语音数据介绍了几个新的带注释的SLU基准任务，这些任务补充了现有的基准并解决SLU评估环境中的差距。我们贡献了四个任务：问题回答和摘要涉及对更长的语音序列的推论；命名实体本地化解决了在信号中找到目标内容的特定语音任务； Dialog Act分类标识给定语音话语的功能。我们遵循口语理解评估（SLUE）基准套件的蓝图。为了促进利用预训练的语音表示成功的SLU模型的开发，我们将为每个任务发布（i）注释相对较小的微调集，（ii）注释的开发和测试集，以及（iii）基线模型，以易于可重复可重复性和比较。在这项工作中，我们介绍了数据收集和注释的详细信息以及基线模型的性能。我们还使用20多个最先进的语音识别模型对管道模型的性能（语音识别器 +文本模型）进行了对语音识别精度的灵敏度分析。

Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题