当莫伯特看不见时，只是一个开始：用多语言模型处理新语言

论文标题

当莫伯特看不见时，只是一个开始：用多语言模型处理新语言

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

论文作者

Muller, Benjamin, Anastasopoulos, Antonis, Sagot, Benoît, Seddah, Djamé

论文摘要

基于大量原始数据的基于训练的语言模型的转移学习已成为NLP中最先进的表现的新规范。尽管如此，尚不清楚如何将这种方法应用于未见的语言，这些语言不受任何可用的大规模多语言模型覆盖，并且通常只有少量的原始数据可用。在这项工作中，通过比较多语言和单语模型，我们表明，这种模型在看不见的语言上以多种方式行事。某些语言从转移学习中受益匪浅，并且与密切相关的高资源语言相似，而另一些语言显然没有。为了关注后者，我们表明这种失败的转移与用于编写此类语言的脚本的影响很大程度上有关。这些语言的音译非常明显地提高了大规模多语言模型在下游任务上的能力。

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题