通过有条件掩盖语言模型的神经机器翻译的语义一致的数据增强

论文标题

通过有条件掩盖语言模型的神经机器翻译的语义一致的数据增强

Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model

论文作者

Cheng, Qiao, Huang, Jin, Duan, Yitao

论文摘要

本文介绍了一种新的数据增强方法，用于神经机器翻译，该方法可以在语言内部和跨语言内部实现更强的语义一致性。我们的方法基于有条件的蒙版语言模型（CMLM），该模型是双向的，可以在左右上下文以及标签上有条件。我们证明CMLM是生成上下文依赖上下文单词分布的好技术。特别是，我们表明CMLM能够通过在替换过程中对源和目标进行调节来实现语义一致性。此外，为了提高多样性，我们将软词替换的想法纳入了数据增强，这将单词替换为词汇上的概率分布。在不同量表的四个翻译数据集上进行的实验表明，总体解决方案会导致更现实的数据增强和更好的翻译质量。与最新作品相比，我们的方法始终取得了最佳性能，并且在基线上提高了1.90个BLEU积分。

This paper introduces a new data augmentation method for neural machine translation that can enforce stronger semantic consistency both within and across languages. Our method is based on Conditional Masked Language Model (CMLM) which is bi-directional and can be conditional on both left and right context, as well as the label. We demonstrate that CMLM is a good technique for generating context-dependent word distributions. In particular, we show that CMLM is capable of enforcing semantic consistency by conditioning on both source and target during substitution. In addition, to enhance diversity, we incorporate the idea of soft word substitution for data augmentation which replaces a word with a probabilistic distribution over the vocabulary. Experiments on four translation datasets of different scales show that the overall solution results in more realistic data augmentation and better translation quality. Our approach consistently achieves the best performance in comparison with strong and recent works and yields improvements of up to 1.90 BLEU points over the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题