论文标题
Mukayese:土耳其NLP罢工
Mukayese: Turkish NLP Strikes Back
论文作者
论文摘要
具有足够的语言资源X可以从资源不足的语言类中提取它,但不一定是从研究不足的班级中。在本文中,我们解决了土耳其语中缺乏有组织基准的问题。我们证明,诸如土耳其语之类的语言被遗留在NLP应用程序中的最先进的后面。作为解决方案,我们提出了Mukayese,这是包含多个NLP任务的土耳其语语言的一组NLP基准。我们为每个基准测试的一个或多个数据集工作,并呈现两个或多个基线。此外,我们介绍了土耳其语中的四个新的基准测试数据集,用于语言建模,句子细分和拼写检查。所有数据集和基准都提供:https://github.com/alisafaya/mukayese
Having sufficient resources for language X lifts it from the under-resourced languages class, but not necessarily from the under-researched class. In this paper, we address the problem of the absence of organized benchmarks in the Turkish language. We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications. As a solution, we present Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks. We work on one or more datasets for each benchmark and present two or more baselines. Moreover, we present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking. All datasets and baselines are available under: https://github.com/alisafaya/mukayese