论文标题
BPE与形态分割:关于四种多功能语言的机器翻译的案例研究
BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages
论文作者
论文摘要
由于数据稀疏性,形态上富含的多合成语言对NLP系统提出了挑战,而处理此问题的常见策略是应用子词细分。我们研究了四种多联合语言的各种各样的监督和无监督的形态分割方法:Nahuatl,Raramuri,Shipibo-Konibo和Wixarika。然后,我们将单个启发的分割方法与字节对编码(BPE)作为机器翻译的输入(MT)进行比较,从西班牙语转换为机器翻译(MT)。我们表明,除了Nahuatl之外,所有语言对(无监督的形态分割算法都始终超过BPE,尽管有监督的方法获得了更好的分割分数,但它们在MT挑战中的表现不足。最后,我们为Raramuri和Shipibo-Konibo贡献了两个新的形态分割数据集,以及一个平行的Raramuri-Spanish语料库。
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish.