几种符合缓存算法的平衡分区

论文标题

几种符合缓存算法的平衡分区

Balanced Partitioning of Several Cache-Oblivious Algorithms

论文作者

Tang, Yuan, Gao, Weiguo

论文摘要

Frigo等。提出了一种理想的缓存模型和一种递归技术，用于以合规性的方式设计顺序高效算法。 Ballard等。指出，将技术扩展到任意体系结构是一个基本的开放问题。 Ballard等。关于如何在任意数量的处理器上精确有效地将Strassen的算法并行提出了另一个开放的问题。我们提出了一种新颖的方式，将合并缓存的算法划分，以在共享记忆环境中在一定范围内的任意数字（甚至是质量数字）上实现完美的强大缩放。我们的方法是意识到处理器，但合并了缓存（PACO）。我们展示了我们对几种重要的高速缓存算法的方法，包括LCS，1D，GAP，经典的矩形矩阵乘数在半度性上和Strassen的算法。我们讨论如何将方法扩展到分布式内存架构，甚至是异质计算系统。因此，我们的工作可能会提供有关将递归缓存技术扩展到任意体系结构的基本开放问题的新观点。我们为并行化的strassen提供了几乎精确的解决方案。我们的方法可能会提供一个新的观点，即将递归缓存技术扩展到任意体系结构。与最著名的算法相比，我们所有的算法都表现出更好的可扩展性或更好的总体平行缓存复杂性。初步实验证明了我们的理论预测是合理的，即PACO算法可以胜过最先进的处理器（PO）和处理器意识（PA）（PA）对应物。

Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen's algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We demonstrate our approach on several important cache-oblivious algorithms, including LCS, 1D, GAP, classic rectangular matrix multiplication on a semiring, and Strassen's algorithm. We discuss how to extend our approach to a distributed-memory architecture, or even a heterogeneous computing system. Hence, our work may provide a new perspective on the fundamental open problem of extending the recursive cache-oblivious technique to an arbitrary architecture. We provide an almost exact solution to the open problem on parallelizing Strassen. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. All our algorithms demonstrate better scalability or better overall parallel cache complexities than the best known algorithms. Preliminary experiments justify our theoretical prediction that the PACO algorithms can outperform significantly state-of-the-art Processor-Oblivious (PO) and Processor-Aware (PA) counterparts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题