论文标题

迈向完全双语的深层语言建模

Towards Fully Bilingual Deep Language Modeling

论文作者

Chang, Li-Hsin, Pyysalo, Sampo, Kanerva, Jenna, Ginter, Filip

论文摘要

近年来,基于深层神经网络的语言模型促进了自然语言处理和理解任务的巨大进步。尽管已经引入了涵盖大量语言的模型,但它们的多语言性在单语表现方面取决于付出代价,并且大多数涉及跨语言转移的大多数任务的表现最佳模型仍然是单语。在本文中,我们考虑了是否可以为两种远程相关语言的双语模型预先培训,而不会损害两种语言的性能。我们收集培训前数据,创建芬兰英语双语BERT模型,并评估其在用于评估相应单语模型的数据集上的性能。我们的双语模型与Google在胶水上的原始英语Bert相当,几乎与一系列芬兰NLP任务上的单语芬兰语Bert的性能相匹配,这显然优于多语言BERT。我们发现,当模型词汇大小增加时,Bert-base架构具有足够的能力,可以学习两种远程相关的语言,以达到与单语模型相当的性能的水平,这表明了训练完全双语的深层语言模型的可行性。 https://github.com/turkunlp/bibert可以免费获得该模型和所有涉及的工具

Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years. While models covering a large number of languages have been introduced, their multilinguality has come at a cost in terms of monolingual performance, and the best-performing models at most tasks not involving cross-lingual transfer remain monolingual. In this paper, we consider the question of whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language. We collect pre-training data, create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models. Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks, clearly outperforming multilingual BERT. We find that when the model vocabulary size is increased, the BERT-Base architecture has sufficient capacity to learn two remotely related languages to a level where it achieves comparable performance with monolingual models, demonstrating the feasibility of training fully bilingual deep language models. The model and all tools involved in its creation are freely available at https://github.com/TurkuNLP/biBERT

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源