基于数据制图的预训练语言模型的混音

论文标题

基于数据制图的预训练语言模型的混音

A Data Cartography based MixUp for Pre-trained Language Models

论文作者

Park, Seo Yeon, Caragea, Cornelia

论文摘要

混合是一种数据增强策略，在训练过程中通过将随机的训练样本及其标签组合在一起，在训练过程中生成其他样品。但是，选择随机对并不是可能的最佳选择。在这项工作中，我们提出了TDMIXUP，这是一种新型的混合策略，利用训练动力学，并允许将更具信息的样本组合起来以生成新的数据样本。我们提出的TDMIXUP首先测量置信度，可变性（Swayamdipta等，2020），以及边缘（AUM）下的面积（Pleiss等，2020），以识别训练样品的特征（例如，易于对细学习的样品或歧义样品），然后介绍了这些特征样品。我们从经验上验证，我们的方法不仅使用较小的训练数据来实现竞争性能，而且与强基础相比，在预训练的语言模型（伯特）上，在各种NLP任务中，对预训练的语言模型BERT的预期校准误差较低。我们公开发布代码。

MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. However, selecting random pairs is not potentially an optimal choice. In this work, we propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples. Our proposed TDMixUp first measures confidence, variability, (Swayamdipta et al., 2020), and Area Under the Margin (AUM) (Pleiss et al., 2020) to identify the characteristics of training samples (e.g., as easy-to-learn or ambiguous samples), and then interpolates these characterized samples. We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks. We publicly release our code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题