论文标题
多语言Wikipedia研究的考虑
Considerations for Multilingual Wikipedia Research
论文作者
论文摘要
长期以来,英语Wikipedia一直是大量研究和自然语言机器学习建模的重要数据源。 Wikipedia的非英语语言版本的增长,更大的计算资源以及在语言和多模型模型表现中获得公平性的呼吁,导致将Wikipedia的更多语言版本包括在数据集和模型中。构建更好的多语言和多模式模型不仅需要访问扩展的数据集。它还需要更好地了解数据中的内容以及如何生成此内容。本文旨在提供一些背景,以帮助研究人员思考Wikipedia不同语言版本之间可能出现的差异以及如何影响其模型。它详细介绍了在使用多语言和多模式数据进行研究和建模的情况下,语言版本之间的内容差异(本地环境,社区和技术)之间的内容差异(本地环境,社区和技术)以及建议的三种主要方式。
English Wikipedia has long been an important data source for much research and natural language machine learning modeling. The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more language editions of Wikipedia in datasets and models. Building better multilingual and multimodal models requires more than just access to expanded datasets; it also requires a better understanding of what is in the data and how this content was generated. This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia and how that might affect their models. It details three major ways in which content differences between language editions arise (local context, community and governance, and technology) and recommendations for good practices when using multilingual and multimodal data for research and modeling.