在文本对抗攻击中保存语义

论文标题

在文本对抗攻击中保存语义

Preserving Semantics in Textual Adversarial Attacks

论文作者

Herel, David, Cisneros, Hugo, Mikolov, Tomas

论文摘要

仇恨在线内容或仇恨言论的增长与针对少数群体的暴力犯罪的全球增加有关[23]。有害的在线内容可以轻松，自动和匿名生产。即使已经通过NLP中的文本分类器来实现某种形式的自动检测，但它们可能会被对抗性攻击所愚弄。为了加强现有系统并保持攻击者的领先地位，我们需要更好的对抗性攻击。在本文中，我们表明，应丢弃由对抗性攻击产生的对抗性示例中多达70％，因为它们不能保留语义。我们解决了这一核心弱点，并提出了一种新的，完全监督的句子嵌入技术，称为语义性传播编码器（SPE）。我们的方法通过实现1.2倍-5.1倍更好的实际攻击成功率，优于在对抗攻击中使用的现有句子编码器。我们将代码作为插件发布，可以在任何现有的对抗攻击中使用，以提高其质量并加快其执行速度。

The growth of hateful online content, or hate speech, has been associated with a global increase in violent crimes against minorities [23]. Harmful online content can be produced easily, automatically and anonymously. Even though, some form of auto-detection is already achieved through text classifiers in NLP, they can be fooled by adversarial attacks. To strengthen existing systems and stay ahead of attackers, we need better adversarial attacks. In this paper, we show that up to 70% of adversarial examples generated by adversarial attacks should be discarded because they do not preserve semantics. We address this core weakness and propose a new, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). Our method outperforms existing sentence encoders used in adversarial attacks by achieving 1.2x - 5.1x better real attack success rate. We release our code as a plugin that can be used in any existing adversarial attack to improve its quality and speed up its execution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题