自动标准化波斯语

论文标题

自动标准化波斯语

Automatic Standardization of Colloquial Persian

论文作者

Rasooli, Mohammad Sadegh, Bakhtyari, Farzane, Shafiei, Fatemeh, Ravanbakhsh, Mahsa, Callison-Burch, Chris

论文摘要

伊朗波斯语有两个品种：标准和口语。波斯语的大多数自然语言处理工具都认为文本是标准形式：在许多真实应用程序尤其是Web内容中，此假设是错误的。本文介绍了一种基于序列到序列翻译的简单有效的标准化方法。我们设计了一种用于生成人工平行的通行通行数据数据的算法，以学习序列到序列模型。此外，我们注释了公开可用的评估数据，该数据由来自各种域的1912句话组成。我们的内在评估显示，与基于现成的规则标准化模型相比，BLEU得分为62.8对61.7，其中原始文本的BLEU得分为46.4。我们还表明，在训练数据中，我们的模型改善了英语到波斯语的机器翻译，这些方案是从通俗的波斯语中，在开发数据中具有1.4绝对BLEU得分差，而在测试数据中为0.8。

The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题