论文标题
语言相关性对基于角色语言模型的跨语言转移学习的影响
Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
论文作者
论文摘要
基于角色的神经网络语言模型(NNLM)具有较小的词汇的优势,因此与基于多个特定单元的NNLM相比,训练时间更快。但是,在低资源场景中,角色和多字符nnlms都遭受数据稀疏性的影响。在这种情况下,跨语言转移通过允许从源到目标语言的信息传输来改善多字符NNLM的性能。同样,我们建议将跨语性转移用于应用于低资源自动语音识别(ASR)的字符nnlms。但是,将跨语性传递应用于字符NNLM并不那么简单。我们观察到,源语言的相关性在字符NNLMS的跨语言预读中起着重要作用。我们在两种目标语言的ASR任务上评估了这一方面:Finnish(以英语和爱沙尼亚语作为来源)和瑞典语(丹麦语,挪威语和英语作为来源)。先前的工作已经观察到将相关语言或无关的语言用于多字符NNLM之间没有区别。但是,我们表明,对于基于角色的NNLM,仅使用相关语言进行预处理可以改善ASR的性能,并且使用无关的语言可能会恶化它。我们还观察到,当目标数据要比源数据少得多时,好处就更大。
Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-character NNLM performance by allowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer for character NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to character NNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretraining of character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) and Swedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelated language for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger when there is much lesser target data than source data.