论文标题
关于预训练的语言模型的可比性
On the comparability of Pre-trained Language Models
论文作者
论文摘要
无监督的表示学习的最新发展成功地建立了NLP转移学习的概念。主要是三个力量推动了这一研究领域的改进:更多的详细体系结构可以更好地利用上下文信息。这些不是简单地插入静态预训练的表示,而是根据具有更智能设计的语言建模目标的端到端训练模型中的周围环境来学习的。随之而来的是,较大的语料库被用作以自我监督的方式进行预训练大语言模型的资源,后来对监督任务进行了微调。并行计算以及云计算中的进步使得与以前建立的模型相同甚至更短的时间内具有增长能力的模型可以训练这些模型。这三个开发项目在新的最新最新结果(SOTA)结果中被揭示的频率更高。这些改进源于何处并不总是很明显,因为不可能完全消除这三个驱动力的贡献。我们旨在在几种大型的预训练的语言模型上提供清晰明确的概述,这些模型在过去两年中实现了SOTA,他们使用了新的体系结构和资源。我们要为读者澄清模型之间的差异,并且我们尝试对词汇/计算改进以及架构变化的单一贡献进行一些见解。我们明确地不打算量化这些贡献,而是将我们的工作视为概述,以确定基准比较的潜在起点。此外,我们暂时希望指出开源和可再现研究领域改善的潜在可能性。
Recent developments in unsupervised representation learning have successfully established the concept of transfer learning in NLP. Mainly three forces are driving the improvements in this area of research: More elaborated architectures are making better use of contextual information. Instead of simply plugging in static pre-trained representations, these are learned based on surrounding context in end-to-end trainable models with more intelligently designed language modelling objectives. Along with this, larger corpora are used as resources for pre-training large language models in a self-supervised fashion which are afterwards fine-tuned on supervised tasks. Advances in parallel computing as well as in cloud computing, made it possible to train these models with growing capacities in the same or even in shorter time than previously established models. These three developments agglomerate in new state-of-the-art (SOTA) results being revealed in a higher and higher frequency. It is not always obvious where these improvements originate from, as it is not possible to completely disentangle the contributions of the three driving forces. We set ourselves to providing a clear and concise overview on several large pre-trained language models, which achieved SOTA results in the last two years, with respect to their use of new architectures and resources. We want to clarify for the reader where the differences between the models are and we furthermore attempt to gain some insight into the single contributions of lexical/computational improvements as well as of architectural changes. We explicitly do not intend to quantify these contributions, but rather see our work as an overview in order to identify potential starting points for benchmark comparisons. Furthermore, we tentatively want to point at potential possibilities for improvement in the field of open-sourcing and reproducible research.