论文标题
评估现代建筑的原子运营成本
Evaluating the Cost of Atomic Operations on Modern Architectures
论文作者
论文摘要
在并行编程中,原子操作(原子),例如比较和swap(CAS)或获取和添加(FAA)无处不在。然而,这些操作与此类系统的各种特征(例如缓存结构)之间的性能权衡尚不清楚,尚未进行彻底分析。在本文中,我们建立了一种评估方法,开发了一个绩效模型,并为不同原子的潜伏期和带宽提供了一组详细的基准。我们考虑各种最新的X86架构:英特尔·哈斯韦尔,Xeon Phi,Ivy Bridge和AMD Bulldozer。结果揭示了所考虑的原子和架构属性之间的惊人性能关系,例如访问的高速缓存线的相干状态。一个关键的发现是,即使以不同的共识数字为特征,所有测试的原子都具有可比的延迟和带宽。另一个见解是,即使已发行的操作之间没有依赖性,原子质的硬件实现也会阻止任何指令级并行性。最后,我们讨论了分析体系结构中发现的性能问题的解决方案。我们的分析可以实现更简单,更有效的并行编程,并加快在现成的机器和大型计算系统中部署的各种体系结构上的数据处理。
Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the hardware implementation of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis enables simpler and more effective parallel programming and accelerates data processing on various architectures deployed in both off-the-shelf machines and large compute systems.