解释的力量：在仇恨言论检测中自动辩护

论文标题

解释的力量：在仇恨言论检测中自动辩护

Power of Explanations: Towards automatic debiasing in hate speech detection

论文作者

Cai, Yi, Zimek, Arthur, Wunder, Gerhard, Ntoutsi, Eirini

论文摘要

仇恨言论检测是自然语言处理（NLP）在现实世界中的常见下游应用。尽管精度提高，但当前的数据驱动方法很容易从源自人类的不平衡数据分布中学习偏见。偏见模型的部署可以进一步增强现有的社会偏见。但是，与处理表格数据（定义和减轻文本分类器中的偏差）不同，该数据处理非结构化数据是更具挑战性的。改善NLP机器学习公平性的一种流行解决方案是通过人类注释者给出的潜在歧视词列表进行伪造过程。除了忽略偏见术语的风险外，用人类注释者详尽地识别偏见是不可持续的，因为歧视是不同数据集之间的变化，并且可能会随着时间的推移而发展。为此，我们提出了一个自动滥用检测器（中），该检测器依靠一种解释方法来检测潜在的偏见。并在此基础上构建了一个带有建议的分阶段校正的端到端偏见框架，专为文本分类器设计，而无需任何外部资源。

Hate speech detection is a common downstream application of natural language processing (NLP) in the real world. In spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. The deployment of biased models could further enhance the existing social biases. But unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. A popular solution for improving machine learning fairness in NLP is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. In addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time. To this end, we propose an automatic misuse detector (MiD) relying on an explanation method for detecting potential bias. And built upon that, an end-to-end debiasing framework with the proposed staged correction is designed for text classifiers without any external resources required.

下载PDF全文

下载文献需遵守相关版权规定

论文标题