CPR：理解和改善失败的耐足培训，以进行深度学习建议和部分恢复

论文标题

CPR：理解和改善失败的耐足培训，以进行深度学习建议和部分恢复

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

论文作者

Maeng, Kiwan, Bharuka, Shivam, Gao, Isabel, Jeffrey, Mark C., Saraph, Vikram, Su, Bor-Yiing, Trippel, Caroline, Yang, Jiyan, Rabbat, Mike, Lucia, Brandon, Wu, Carole-Jean

论文摘要

本文提出并优化了部分恢复培训系统CPR，以进行推荐模型。 CPR通过启用非失败节点在训练过程中失败而无需加载检查点而在不加载检查点的情况下放松一致性要求，从而改善了与故障相关的开销。该论文是我们知识范围内的第一个文章，对将部分恢复应用于建议模型进行了数据驱动，深入的分析，并确定了准确性和性能之间的权衡。在分析的激励下，我们提出了CPR，这是一种部分恢复训练系统，可以减少训练时间并通过（1）估算部分恢复的益处，（2）选择适当的检查点储蓄间隔，以及（3）优先级以节省更常见访问的参数的更新。 CPR，CPR-MFU和CPR-SSU的两种变体将与检查点相关的开销从8.2-8.5％降低至0.53-0.68％，而与完全恢复相比，在模拟生产规模集群的失败模式和架设上，与完全恢复相比。 CPR在大幅度降低开销的同时，与更昂贵的完整恢复计划达到了模型质量，并使用Criteo的ADS CTR数据集训练最先进的推荐模型。我们的初步结果还表明，CPR可以加快对真实生产规模集群的训练，而不会大大降低准确性。

The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2-8.5% to 0.53-0.68% compared to full recovery, on a configuration emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo's Ads CTR dataset. Our preliminary results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题