网络欺凌分类器对模型敏捷的扰动敏感

论文标题

网络欺凌分类器对模型敏捷的扰动敏感

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

论文作者

Emmery, Chris, Kádár, Ákos, Chrupała, Grzegorz, Daelemans, Walter

论文摘要

有限的研究调查了模型不足的对抗行为在有毒内容分类中的作用。由于毒性分类器主要依靠词汇提示，因此（故意）创造性和不断发展的语言使用可能会在部署以进行内容审核时将当前语料库和最先进模型的实用性造成损害。培训数据越少，模型可能就越脆弱。据我们所知，这项研究是第一个研究对抗行为和网络欺凌检测的效果的一种。我们证明，模型不足的词汇取代极大地损害了分类器的性能。此外，当使用这些干扰样品进行增强时，我们表明模型在整体任务绩效方面略有折衷而与单词级扰动变得强大。先前关于毒性工作中提出的增强证明效果不佳。我们的结果强调了在小型语料库中对在线危害领域进行此类评估的需求。 https://github.com/cmry/augtox可以使用扰动的数据，模型和代码

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora. The perturbed data, models, and code are available for reproduction at https://github.com/cmry/augtox

下载PDF全文

下载文献需遵守相关版权规定

论文标题