人类指导的自然语言处理公平分类

论文标题

人类指导的自然语言处理公平分类

Human-Guided Fair Classification for Natural Language Processing

论文作者

Dorner, Florian E., Peychev, Momchil, Konstantinov, Nikola, Goel, Naman, Ash, Elliott, Vechev, Martin

论文摘要

文本分类器在诸如简历筛选和内容审核之类的高级任务中具有有希望的应用程序。这些分类器必须是公平的，并且通过对诸如性别或种族等敏感属性的扰动不变来避免歧视性决策。但是，人类关于这些扰动的直觉与捕获它们的形式相似性规范之间存在差距。尽管现有的研究已经开始解决这一差距，但当前的方法基于硬编码单词置换，导致表达性有限或无法完全与人类直觉完全一致的规格（例如，在不对称反事实的情况下）。这项工作提出了通过发现表达和直观的个人公平规范来弥合这一差距的新方法。我们展示了如何利用无监督的样式转移和GPT-3的零击功能来自动生成符合敏感属性不同的语义上相似句子的表达性候选对。然后，我们通过一项广泛的众包研究验证了生成的对，该研究证实了许多这些对在毒性分类的背景下与关于公平性的人类直觉一致的。最后，我们展示了如何利用有限的人类反馈来学习相似性规范，该规范可用于训练下游公平感知的模型。

Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题