论文标题

Hoplite:基于任务的分布式系统的高效且容忍故障的集体通信

Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems

论文作者

Zhuang, Siyuan, Li, Zhuohan, Zhuo, Danyang, Wang, Stephanie, Liang, Eric, Nishihara, Robert, Moritz, Philipp, Stoica, Ion

论文摘要

基于任务的分布式框架(例如Ray,Dask,Hydro)在包含异步和动态工作负载的分布式应用程序中变得越来越流行,包括异步梯度下降,增强加固学习和模型服务。随着越来越多的数据密集型应用程序转移到基于任务的系统之上,集体沟通效率已成为一个重要的问题。不幸的是,传统的集体沟通库(例如MPI,Horovod,NCCL)是不合适的,因为它们需要在运行时知道沟通时间表,并且不提供容错的容忍度。 我们设计和实施Hoplite,这是一个为基于任务的分布式系统的高效且容忍性的集体通信层。我们的关键技术是即时计算数据传输计划,并通过细粒度的管道有效地执行时间表。同时,当任务失败时,数据传输时间表会迅速调整,以允许其他任务继续取得进展。我们将Hoplite应用于流行的基于任务的分布式框架Ray。我们表明,Hoplite加快了异步的随机梯度下降,增强学习以及提供机器学习模型的合奏,这些模型分别以高达7.8倍,3.9倍和3.3倍的形式很难有效地执行。

Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源