温暖的开始和干净的爬行语料库 - 一种良好语言模型的秘诀

论文标题

温暖的开始和干净的爬行语料库 - 一种良好语言模型的秘诀

A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models

论文作者

Snæbjarnarson, Vésteinn, Símonarson, Haukur Barri, Ragnarsson, Pétur Orri, Ingólfsdóttir, Svanhvít Lilja, Jónsson, Haukur Páll, Þorsteinsson, Vilhjálmur, Einarsson, Hafsteinn

论文摘要

我们在包括Icebert在内的几种语言模型培训了几种在各种下游任务中实现最先进的性能的语言模型，包括言论部分标记，命名实体识别，语法错误检测和选区解析。为了训练模型，我们介绍了冰岛的新文本，即冰岛共同爬网语料库（IC3），这是通过针对冰岛冰岛顶级域（TLD）来在线发现的高质量文本的集合。还收集了总共16GB的冰岛文本，还收集了其他一些公共数据源。为了增强对模型性能的评估并提高冰岛基线的栏，我们翻译并调整了Winogrande数据集以共同参考分辨率。通过这些努力，与在策划的语料库中训练的模型相比，通过适当清洁的爬行语料库足以在低至中级资源语言的NLP应用中获得最新的结果。我们进一步表明，使用现有多语言模型的初始化模型可以导致一些下游任务的最新结果。

We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD). Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we translate and adapt the WinoGrande dataset for co-reference resolution. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题