Trimbert：为权衡调整伯特

论文标题

Trimbert：为权衡调整伯特

TrimBERT: Tailoring BERT for Trade-offs

论文作者

Sridhar, Sharath Nittur, Sarah, Anthony, Sundaresan, Sairam

论文摘要

基于BERT的模型在解决各种自然语言处理（NLP）任务方面非常成功。不幸的是，许多大型模型都需要大量的计算资源和/或时间进行预培训和微调，这限制了更广泛的可采用性。尽管自我发场层进行了充分研究，但在文献中，将其包括在内的中间层的强有力的理由仍然缺失。在这项工作中，我们表明，减少BERT碱基中中间层的数量会导致下游任务的微调精度损失最小，同时显着减少了模型大小和训练时间。我们通过用更简单的替代方案替换自我发项层中的所有软磁性操作，进一步减轻两个关键的瓶颈，并删除所有分层操作的一半。这进一步降低了训练时间，同时保持了高水平的微调精度。

Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further decreases the training time while maintaining a high level of fine-tuning accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题