论文标题
带有通信压缩和Bernoulli聚集的分布式牛顿型方法
Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation
论文作者
论文摘要
尽管计算高昂和通信成本,牛顿型方法仍然是分布式培训的一种吸引人的选择,因为它们对不良条件的凸问题进行了稳健性。在这项工作中,我们研究了通信压缩和曲率信息的聚合机制,以降低这些成本,同时保留理论上优越的局部收敛保证。我们证明了Richtarik等人最近开发的三点压缩机(3PC)类别。 [2022]对于梯度交流也可以推广到Hessian通信。该结果开辟了各种各样的通信策略,例如收缩压缩}和懒惰聚集,可用于我们的处置,以压缩昂贵的曲率信息。此外,我们发现了几种新的3PC机制,例如自适应阈值和Bernoulli聚集,这些机制需要减少通信和偶尔的Hessian计算。此外,我们扩展并分析了双向通信压缩和部分设备参与设置的方法,以迎合联邦学习中应用的实际考虑。对于我们的所有方法,我们得出了与速度无关的局部线性和/或超线性收敛速率的快速条件。最后,通过对凸优化问题进行广泛的数值评估,我们说明我们设计的方案与使用二阶信息相比,与几个关键基线相比,我们的设计方案达到了最新的通信复杂性。
Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study ommunication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of Richtarik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression} and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.