论文标题
自我混合:通过自用培训的稳健学习反对文本标签噪音
SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training
论文作者
论文摘要
文本分类的常规成功取决于注释的数据,预先训练的语言模型(PLM)的新范式仍然需要一些标记的数据来进行下游任务。但是,在实际应用中,标签噪声不可避免地存在于训练数据中,损害了在此类数据上构建的模型的有效性,鲁棒性和概括。最近,已经取得了显着的成就来减轻视觉数据中的这种困境,而只有少数探索文本数据。为了填补这一空白,我们提出了一种简单而有效的方法,可以处理文本分类任务中的标签噪声。 SelfMix使用高斯混合模型将样本分开,并利用半监督的学习。与以前需要多种模型的工作不同,我们的方法利用单个模型上的辍学机制来减少自我训练中的确认偏见,并引入文本级别的混合训练策略。具有不同类型文本的三个文本分类基准的实验结果表明,我们所提出的方法的性能优于这些强大的基线,这些基本线在不同的噪声比和噪声类型下都为文本和视觉数据设计。我们的代码可从https://github.com/noise-learning/selfmix获得。
The conventional success of textual classification relies on annotated data, and the new paradigm of pre-trained language models (PLMs) still requires a few labeled data for downstream tasks. However, in real-world applications, label noise inevitably exists in training data, damaging the effectiveness, robustness, and generalization of the models constructed on such data. Recently, remarkable achievements have been made to mitigate this dilemma in visual data, while only a few explore textual data. To fill this gap, we present SelfMix, a simple yet effective method, to handle label noise in text classification tasks. SelfMix uses the Gaussian Mixture Model to separate samples and leverages semi-supervised learning. Unlike previous works requiring multiple models, our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training and introduces a textual-level mixup training strategy. Experimental results on three text classification benchmarks with different types of text show that the performance of our proposed method outperforms these strong baselines designed for both textual and visual data under different noise ratios and noise types. Our code is available at https://github.com/noise-learning/SelfMix.