论文标题

Trimbert:为权衡调整伯特

TrimBERT: Tailoring BERT for Trade-offs

论文作者

Sridhar, Sharath Nittur, Sarah, Anthony, Sundaresan, Sairam

论文摘要

基于BERT的模型在解决各种自然语言处理(NLP)任务方面非常成功。不幸的是,许多大型模型都需要大量的计算资源和/或时间进行预培训和微调,这限制了更广泛的可采用性。尽管自我发场层进行了充分研究,但在文献中,将其包括在内的中间层的强有力的理由仍然缺失。在这项工作中,我们表明,减少BERT碱基中中间层的数量会导致下游任务的微调精度损失最小,同时显着减少了模型大小和训练时间。我们通过用更简单的替代方案替换自我发项层中的所有软磁性操作,进一步减轻两个关键的瓶颈,并删除所有分层操作的一半。这进一步降低了训练时间,同时保持了高水平的微调精度。

Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further decreases the training time while maintaining a high level of fine-tuning accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源