论文标题

Wikipda的跨语言主题建模

Crosslingual Topic Modeling with WikiPDA

论文作者

Piccardi, Tiziano, West, Robert

论文摘要

我们提出了基于Wikipedia的Polyglot Dirichlet分配(Wikipda),这是一个跨语言主题模型,学会代表以任何语言写作的Wikipedia文章,作为一组常见语言独立主题的分布。它利用了一个事实,即Wikipedia文章彼此链接并映射到Wikidata知识库中的概念,因此,当将其表示为链接袋时,文章本质上是与语言无关的。 Wikipda首先使用矩阵完成,然后训练标准的单语主题模型,将Wikipda分为两个步骤。人类评估表明,Wikipda比基于文本的LDA产生的一致性更连贯的主题,因此无需提供跨语言。我们在两个应用中演示了Wikipda的实用程序:对28个Wikipedia版本中的局部偏见的研究以及跨语言监督分类。最后,我们重点介绍了Wikipda的零声语言传输能力,在这些语言转移的能力中,在没有任何微调的情况下,将模型重复使用新语言。研究人员可以从Wikipda中受益,这是一种实用的工具,可以通过https://github.com/epfl-dlab/wikipda公开获得299个语言版本中的Wikipedia内容。

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at https://github.com/epfl-dlab/WikiPDA.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源