论文标题

偏见的诞生:关于英语模型中性别偏见演变的案例研究

The Birth of Bias: A case study on the evolution of gender bias in an English language model

论文作者

van der Wal, Oskar, Jumelet, Jaap, Schulz, Katrin, Zuidema, Willem

论文摘要

在现代语言模型中检测和缓解有害偏见被广泛认为是至关重要的开放问题。在本文中,我们退后一步,研究语言模型首先是如何偏见的。我们使用在英语Wikipedia语料库中训练的LSTM架构,使用相对较小的语言模型。通过在训练期间的每一步中发生更改时,我们可以完全访问数据和模型参数,我们可以详细介绍性别表示形式的发展,数据集中的哪些模式在此驱动器以及模型的内部状态如何与下游任务中的偏差(语义文本相似性)中的偏差相关。我们发现性别的表示是动态的,并在训练过程中确定了不同的阶段。此外,我们表明,性别信息在模型的输入嵌入中越来越多地表示,因此,对这些性别的影响可以有效地减少下游偏见。监测训练动力学,使我们能够检测出在输入嵌入中如何表示男性和男性性别的不对称性。这很重要,因为这可能会导致幼稚的缓解策略引入新的不良偏见。我们更广泛地讨论了发现与缓解策略的相关性,以及将我们的方法推广到更大语言模型,变压器体系结构,其他语言和其他不良偏见的前景。

Detecting and mitigating harmful biases in modern language models are widely recognized as crucial, open problems. In this paper, we take a step back and investigate how language models come to be biased in the first place. We use a relatively small language model, using the LSTM architecture trained on an English Wikipedia corpus. With full access to the data and to the model parameters as they change during every step while training, we can map in detail how the representation of gender develops, what patterns in the dataset drive this, and how the model's internal state relates to the bias in a downstream task (semantic textual similarity). We find that the representation of gender is dynamic and identify different phases during training. Furthermore, we show that gender information is represented increasingly locally in the input embeddings of the model and that, as a consequence, debiasing these can be effective in reducing the downstream bias. Monitoring the training dynamics, allows us to detect an asymmetry in how the female and male gender are represented in the input embeddings. This is important, as it may cause naive mitigation strategies to introduce new undesirable biases. We discuss the relevance of the findings for mitigation strategies more generally and the prospects of generalizing our methods to larger language models, the Transformer architecture, other languages and other undesirable biases.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源