建立和策划对话语料库，用于多样性意识语言科学技术

论文标题

建立和策划对话语料库，用于多样性意识语言科学技术

Building and curating conversational corpora for diversity-aware language science and technology

论文作者

Liesenfeld, Andreas, Dingemanse, Mark

论文摘要

我们提出了分析管道和最佳实践指南，以用不同的语言构建和策划日常对话的语料库。调查语言文档语料库和其他涵盖28个门的67种语言和品种的资源，我们描述了汇编和策展过程，指定了统一格式的交互数据的最小属性，并为质量控制开发了考虑转弯和时间的质量控制方法。两个案例研究表明，对话数据的广泛效用（i）绘制人类互动基础设施以及（ii）追踪当前ASR解决方案的挑战和机会。语言上多样化的对话语料库可以为语言科学提供新的见解，并为语言技术提供更强的经验基础。

We present an analysis pipeline and best practice guidelines for building and curating corpora of everyday conversation in diverse languages. Surveying language documentation corpora and other resources that cover 67 languages and varieties from 28 phyla, we describe the compilation and curation process, specify minimal properties of a unified format for interactional data, and develop methods for quality control that take into account turn-taking and timing. Two case studies show the broad utility of conversational data for (i) charting human interactional infrastructure and (ii) tracing challenges and opportunities for current ASR solutions. Linguistically diverse conversational corpora can provide new insights for the language sciences and stronger empirical foundations for language technology.

下载PDF全文

下载文献需遵守相关版权规定

论文标题