论文标题
ALP:减轻以内存为中心系统中的CPU内存数据运动开销
ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems
论文作者
论文摘要
NDP和主机CPU内核之间的分区应用程序会导致段间数据移动开销,这是由从一个段(例如指令,功能)生成的移动数据引起的,并用于连续段中。先前的工作采用了两种方法解决此问题。基于每个段的属性的第一类作品映射段或主机核心的片段,忽略了段间数据运动开销。第二类Works分区的应用程序基于每个细分市场的整体内存带宽节省,并且如果它们产生高段间数据移动,则不会将每个细分卸载到最合适的核心。我们表明,1)将每个细分市场映射到理想的核心最佳核心可以提供可观的好处,2)段间数据移动可显着降低此益处。 为此,我们介绍了ALP,这是一种新的程序员透明技术,通过减轻主机和内存之间的段间数据运动开销并实现应用程序的有效分区,以利用NDP的性能优势。 ALP通过主动并准确地在各个段之间传输所需的数据来减轻段间数据移动开销。这是基于关键观察,即在不同输入的程序的不同执行中生成段间数据的指令保持不变。 ALP使用编译器通行证来识别这些说明,并使用专门的硬件在运行时在主机和NDP内核之间传输数据。 ALP有效地将应用程序段映射到主机或NDP,考虑1)每个段的属性,2)段间数据运动开销,3)3)是否可以及时缓解此开销。我们在广泛的工作量上评估了ALP,并且分别仅与主持人CPU或仅执行NDP相比,平均显示54.3%和45.4%的速度。
Partitioning applications between NDP and host CPU cores causes inter-segment data movement overhead, which is caused by moving data generated from one segment (e.g., instructions, functions) and used in consecutive segments. Prior works take two approaches to this problem. The first class of works maps segments to NDP or host cores based on the properties of each segment, neglecting the inter-segment data movement overhead. The second class of works partitions applications based on the overall memory bandwidth saving of each segment, and does not offload each segment to the best-fitting core if they incur high inter-segment data movement. We show that 1) mapping each segment to its best-fitting core ideally can provide substantial benefits, and 2) the inter-segment data movement reduces this benefit significantly. To this end, we introduce ALP, a new programmer-transparent technique to leverage the performance benefits of NDP by alleviating the inter-segment data movement overhead between host and memory and enabling efficient partitioning of applications. ALP alleviates the inter-segment data movement overhead by proactively and accurately transferring the required data between the segments. This is based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program on different inputs. ALP uses a compiler pass to identify these instructions and uses specialized hardware to transfer data between the host and NDP cores at runtime. ALP efficiently maps application segments to either host or NDP considering 1) the properties of each segment, 2) the inter-segment data movement overhead, and 3) whether this overhead can be alleviated in a timely manner. We evaluate ALP across a wide range of workloads and show on average 54.3% and 45.4% speedup compared to only-host CPU or only-NDP executions, respectively.