Jampi：使用屏障执行模式在火花中有效的矩阵乘法

论文标题

Jampi：使用屏障执行模式在火花中有效的矩阵乘法

JAMPI: efficient matrix multiplication in Spark using Barrier Execution Mode

论文作者

Foldi, Tamas, von Csefalvay, Chris, Perez, Nicolas A.

论文摘要

Apache Spark中的新障碍模式允许将分布式深度学习训练嵌入到火花阶段，以简化分布式训练工作流程。在Spark中，阶段的任务不取决于同一阶段的任何其他任务，因此可以独立安排。但是，几种算法需要更复杂的任务间通信，类似于MPI范式。通过组合分布式消息传递（使用异步网络IO），OpenJDK的新自动矢量化和Spark的屏障执行模式，我们可以添加非映射/降低基于基于基于的算法，例如Cannon的分布式矩阵乘法。我们使用Cannon的算法记录了有效的分布式矩阵乘法，该算法在现有MLLIB实现的性能方面有了显着改善。本文所述的算法在屏障任务中使用，导致10,000x10,000平方矩阵的性能提高高达24％，并且记忆足迹明显降低。有效的矩阵乘法的应用包括加速基于深卷卷积神经网络的工作负载的培训和实施，因此，这种有效的算法可以在更快，更有效地执行最复杂的机器学习任务方面发挥着突破性的作用。

The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage does not depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon's distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon's algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24 percent performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a ground-breaking role in faster, more efficient execution of even the most complicated machine learning tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题