对日本零过敏分辨率的上下文数据增强的实证研究

论文标题

对日本零过敏分辨率的上下文数据增强的实证研究

An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

论文作者

Konno, Ryuto, Matsubayashi, Yuichiroh, Kiyono, Shun, Ouchi, Hiroki, Takahashi, Ryo, Inui, Kentaro

论文摘要

零过敏分辨率（ZAR）的一个关键问题是标记数据的稀缺性。这项研究探讨了如何有效地通过数据增加来缓解此问题。我们采用了一种称为上下文数据增强（CDA）的最先进的数据增强方法，该方法使用验证的语言模型生成了标记的培训实例。据报道，CDA可以很好地适用于其他几项自然语言处理任务，包括文本分类和机器翻译。这项研究解决了CDA上的两个未充满激发的问题，即如何降低数据增强的计算成本以及如何确保生成的数据的质量。我们还提出了两种将CDA适应ZAR的方法：[基于掩码]基于掩埋的增强和语言控制的掩蔽。因此，日本ZAR的实验结果表明，我们的方法有助于准确性增长和计算成本降低。我们的仔细分析表明，与常规CDA相比，提出的方法可以提高增强训练数据的质量。

One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for several other natural language processing tasks, including text classification and machine translation. This study addresses two underexplored issues on CDA, that is, how to reduce the computational cost of data augmentation and how to ensure the quality of the generated data. We also propose two methods to adapt CDA to ZAR: [MASK]-based augmentation and linguistically-controlled masking. Consequently, the experimental results on Japanese ZAR show that our methods contribute to both the accuracy gain and the computation cost reduction. Our closer analysis reveals that the proposed method can improve the quality of the augmented training data when compared to the conventional CDA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题