论文标题
GeoMlama:地理多样性常识探测多语言预训练的语言模型
GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models
论文作者
论文摘要
最近的工作表明,预训练的语言模型(PLM)存储了从数据中学到的关系知识,并利用它来执行下游任务。但是,不同地区之间的常识性知识可能会有所不同。例如,新娘礼服的颜色是美国婚礼上的白色,而中国婚礼则是红色的。在本文中,我们介绍了一个基准数据集,地理多样性共识多语言模型分析(Geomlama),以探测多语言PLM中关系知识的多样性。 Geomlama在英语,中文,印地语,波斯和斯瓦希里语中包含3125个提示,并涵盖了美国,中国,印度,伊朗和肯尼亚文化的人们共享的广泛概念。我们在Geomlama上基准11个标准多语言PLM。有趣的是,我们发现1)较大的多语言PLMS变体不一定比其较小的变体更好地存储地理多样性概念; 2)多语言PLM并非内在地偏向西方国家(美国)的知识; 3)一个国家的母语可能不是探究其知识的最佳语言,而4)语言可以比其祖国更好地探究关于非本地国家的知识。代码和数据在https://github.com/wadeyin9712/geomlama上发布。
Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.