论文标题

基准测试零射和几乎没有射击方法,用于加禄语文本的标记,标记和依赖性解析

Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text

论文作者

Aquino, Angelina, de Leon, Franz

论文摘要

任何书面语言中文本的语法分析通常涉及许多基本的处理任务,例如令牌化,形态标记和依赖性解析。最先进的系统可以在具有大数据集的语言上实现这些任务的高度准确性,但是对于几乎没有带注释的数据的语言产生的结果很差。为了解决他加禄语语言的此问题,我们研究了在没有依赖性通知的他加禄语数据的情况下,使用替代语言资源来创建特定于任务的模型。我们还探索了单词嵌入和数据扩展的使用,以提高性能,而只有少量的带塔瓦格数据可用。我们表明,与最先进的监督基线相比,这些零射击和几乎没有射击的方法在对域内和不域外的他加禄语文本的语法分析中产生了重大改进。

The grammatical analysis of texts in any written language typically involves a number of basic processing tasks, such as tokenization, morphological tagging, and dependency parsing. State-of-the-art systems can achieve high accuracy on these tasks for languages with large datasets, but yield poor results for languages which have little to no annotated data. To address this issue for the Tagalog language, we investigate the use of alternative language resources for creating task-specific models in the absence of dependency-annotated Tagalog data. We also explore the use of word embeddings and data augmentation to improve performance when only a small amount of annotated Tagalog data is available. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text compared to state-of-the-art supervised baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源