论文标题
自我监督的言语模型会形成类似人类的感知偏见吗?
Do self-supervised speech models develop human-like perception biases?
论文作者
论文摘要
语音处理的自我监督模型形成代表空间,而无需使用任何外部标签。越来越多地,它们似乎是至少部分消除昂贵的手动注释的一种可行方法,这是对低资源语言特别关注的问题。但是这些模型构建了什么样的代表空间?人类的感知专门研究听众的母语。在自我监督的模型中是否发生了同样的事情?我们研究了三种最先进的自我监督模型的代表空间:WAV2VEC 2.0,休伯特和对比性预测性编码(CPC),并将它们与全球和英语讲法语和英语的人类听众的知觉空间进行比较,并考虑了两种语言群体之间的行为差异。我们表明,CPC模型显示出很小的母语效应,但是WAV2VEC 2.0和Hubert似乎开发了一个普遍的语音感知空间,这不是语言的特定于语言。与监督电话识别者的预测的比较表明,所有三种自我监督的模型都捕获了相对细粒度的感知现象,而监督模型则更好地捕获了听众的母语,对听众的效果,对感知的效果。
Self-supervised models for speech processing form representational spaces without using any external labels. Increasingly, they appear to be a feasible way of at least partially eliminating costly manual annotations, a problem of particular concern for low-resource languages. But what kind of representational spaces do these models construct? Human perception specializes to the sounds of listeners' native languages. Does the same thing happen in self-supervised models? We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding (CPC), and compare them with the perceptual spaces of French-speaking and English-speaking human listeners, both globally and taking account of the behavioural differences between the two language groups. We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and HuBERT seem to develop a universal speech perception space which is not language specific. A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.