论文标题
GLM-130b:开放双语的预训练模型
GLM-130B: An Open Bilingual Pre-trained Model
论文作者
论文摘要
我们介绍了GLM-130B,这是一种具有1300亿个参数的双语(英语和中文)预培训的语言模型。这是一种尝试至少与GPT-3(Davinci)一样好的开放源模型,并揭示了如何成功培训这种规模的模型。在这项工作的过程中,我们面临着许多意外的技术和工程挑战,尤其是在损失峰值和分歧方面。在本文中,我们介绍了GLM-130B的培训过程,包括其设计选择,效率和稳定性的培训策略以及工程工作。最终的GLM-130B型号在广泛流行的英语基准上提供了比GPT-3 175B(Davinci)的高表现,而在Opt-175b和Bloom-176b中未观察到性能优势。在相关基准中,它也一致地优结于Ernie Titan 3.0 260B(最大的中文模型)。最后,我们利用GLM-130B的独特缩放属性无需培训即可到达INT4量化,几乎没有性能损失,这使其成为100B级级型号中的第一个,更重要的是,它可以有效推断4 $ \ times $ \ times $ rtx 3090(24G)(24G)或8 $ \ times $ \ times $ rtx 2080 TI(11G)TI(11G)TI(11G)型号,可用于最多可使用100 gpus gpus。 GLM-130B模型权重可以公开访问,其代码,培训日志,相关工具包和经验教训是通过\ url {https://github.com/thudm/glm/glm-130b/}开源的。
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.