自然语言推断减轻伪像和预训练的模型优化的多尺度数据增强方法

论文标题

自然语言推断减轻伪像和预训练的模型优化的多尺度数据增强方法

Multi-Scales Data Augmentation Approach In Natural Language Inference For Artifacts Mitigation And Pre-Trained Model Optimization

论文作者

Lu, Zhenyuan

论文摘要

机器学习模型可以在基准自然语言处理（NLP）数据集上达到高性能，但在更具挑战性的设置中失败了。当预先训练的模型以自然语言推理（NLI）学习数据集文物时，我们会研究这个问题，这是研究一对文本序列之间逻辑关系的主题。我们提供了各种技术，用于分析和定位众包Stanford自然语言推断（SNLI）语料库中的数据集文物。我们研究了SNLI中数据集文物的风格模式。为了减轻数据集工件，我们采用具有两个不同框架的唯一多尺度数据增强技术：在句子级别上的行为测试清单，以及在单词级别上的词汇同义词标准。具体而言，我们的组合方法增强了模型对扰动测试的抵抗力，从而使其能够不断优于预先训练的基线。

Machine learning models can reach high performance on benchmark natural language processing (NLP) datasets but fail in more challenging settings. We study this issue when a pre-trained model learns dataset artifacts in natural language inference (NLI), the topic of studying the logical relationship between a pair of text sequences. We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference (SNLI) corpus. We study the stylistic pattern of dataset artifacts in the SNLI. To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks: a behavioral testing checklist at the sentence level and lexical synonym criteria at the word level. Specifically, our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题