论文标题
CODA:自然语言理解的对比度增强和多样性数据增强
CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding
论文作者
论文摘要
数据增强已被证明是提高模型概括和数据效率的有效策略。但是,由于自然语言的离散性质,为文本数据设计具有标签的保留转换往往更具挑战性。在本文中,我们提出了一个新颖的数据增强框架,称为CODA,该框架通过有机地整合多个变换来综合了多样化和信息丰富的增强示例。此外,引入了对比度正规化目标,以捕获所有数据样本之间的全球关系。进一步利用了动量编码器以及记忆库,以更好地估计对比度损失。为了验证所提出的框架的有效性,我们将CODA应用于基于变压器的模型,以了解广泛的自然语言理解任务。在胶水基准上,应用于罗伯塔大型模型时,尾巴的平均改善为2.2%。更重要的是,相对于几种竞争性数据增强和对抗性训练基线(包括低资源设置),它始终表现出更强的结果。广泛的实验表明,提出的对比目标可以灵活地与各种数据增强方法相结合,以进一步提高其性能,从而突出了CODA框架的广泛适用性。
Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation framework dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization objective is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2% while applied to the RoBERTa-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training base-lines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.