论文标题
自动标准化波斯语
Automatic Standardization of Colloquial Persian
论文作者
论文摘要
伊朗波斯语有两个品种:标准和口语。波斯语的大多数自然语言处理工具都认为文本是标准形式:在许多真实应用程序尤其是Web内容中,此假设是错误的。本文介绍了一种基于序列到序列翻译的简单有效的标准化方法。我们设计了一种用于生成人工平行的通行通行数据数据的算法,以学习序列到序列模型。此外,我们注释了公开可用的评估数据,该数据由来自各种域的1912句话组成。我们的内在评估显示,与基于现成的规则标准化模型相比,BLEU得分为62.8对61.7,其中原始文本的BLEU得分为46.4。我们还表明,在训练数据中,我们的模型改善了英语到波斯语的机器翻译,这些方案是从通俗的波斯语中,在开发数据中具有1.4绝对BLEU得分差,而在测试数据中为0.8。
The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data.