论文标题

DeepFlow:分布式AI系统的交叉堆栈探路框架

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

论文作者

Ardalani, Newsha, Pal, Saptadeep, Gupta, Puneet

论文摘要

在过去的十年中,机器学习模型的复杂性已经以非凡的速度增长,系统的规模也训练了如此大的模型。但是,大规模AI系统中有令人震惊的低硬件利用率(5-20​​%)。低系统利用率是堆栈不同层的微小损失的累积效应,这会因设计跨不同行业的不同层的工程师之间的断开而加剧。我们提出了CrossFlow,这是一个新颖的框架,可以从技术层到算法层一直启用跨层分析。我们还提出了DeepFlow(使用机器学习技术建立在交叉流的顶部),以自动化设计空间探索和跨堆栈不同层的合作式化。我们通过对实际商业硬件的分布式培训验证了跨流的精度,并展示了几个深流案例研究,这些案例研究表明,在技术硬件软件堆栈中不优化可能是可能的东西,这是最重要的工作负载,这是在计算堆栈的所有方面推动大型开发投资。

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. We propose CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. We also propose DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源