论文标题

多峰恢复的多任务学习方法

A Multitask Learning Approach for Diacritic Restoration

论文作者

Alqahtani, Sawsan, Mishra, Ajay, Diab, Mona

论文摘要

在阿拉伯语等许多语言中,变音术用于指定发音和含义。书面文本通常省略了这种变化剂,增加了单词的可能发音和含义的数量。这导致更含糊的文本使对此类文本的计算处理更加困难。变音率修复是在书面文本中恢复缺失的变音术的任务。大多数最先进的变性恢复模型都建立在字符级别的信息上,该信息有助于概括模型以看不见数据,但大概在单词级别上丢失了有用的信息。因此,为了弥补这一损失,我们研究了使用多任务学习的使用以与相关的NLP问题共同优化变性恢复,即单词分割,词性词性标记和句法大变化。我们使用阿拉伯语作为案例研究,因为它具有足够的数据资源,用于我们在联合建模中考虑的任务。我们的联合模型显着优于基线,并且与依赖形态分析仪和/或更多数据(例如方言数据)更为复杂的最新模型相媲美。

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-of-the-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源