用于自动编辑的半监督学习：通过错误的术语掩盖掩模的数据合成

论文标题

用于自动编辑的半监督学习：通过错误的术语掩盖掩模的数据合成

Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms

论文作者

Lee, Wonkee, Heo, Seong-Hwan, Lee, Jong-Hyeok

论文摘要

半监督的学习，由于缺乏培训数据，因此已广泛采用了为培训的合成数据用于培训的合成数据。以此目的，我们专注于创建高质量合成数据的数据合成方法。鉴于APE作为输入作为一个机器翻译结果，可能包括错误，我们提出了一种数据合成方法，通过该方法，所得的合成数据模仿了实际数据中发现的翻译错误。我们通过调整蒙版语言模型方法来引入一种基于Noising的数据合成方法，从而通过用错误的令牌填充蒙面令牌来从干净的文本中产生嘈杂的文本。此外，我们提出了选择性语料库的交织，将两个单独的合成数据集结合在一起，仅采用有利的样品以进一步提高合成数据的质量。实验结果表明，使用我们的方法创建的合成数据，与现有方法创建的其他合成数据相比，猿类性能的结果明显好得多。

Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题