BAM系统体系结构中的GPU引发的按需高通量存储访问

论文标题

BAM系统体系结构中的GPU引发的按需高通量存储访问

GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture

论文作者

Qureshi, Zaid, Mailthody, Vikram Sharma, Gelado, Isaac, Min, Seung Won, Masood, Amna, Park, Jeongmin, Xiong, Jinjun, Newburn, CJ, Vainbrand, Dmitri, Chung, I-Hsin, Garland, Michael, Dally, William, Hwu, Wen-mei

论文摘要

传统上，图形处理单元（GPU）依靠主机CPU来启动对数据存储的访问。这种方法非常适合具有已知数据访问模式的GPU应用程序，使其数据集可以在GPU中以管道方式进行处理。但是，诸如图形和数据分析，推荐系统或图形神经网络之类的新兴应用程序需要细粒度，数据依赖于数据的存储。由于CPU-GPU高度同步开销，I/O流量放大和长时间的CPU处理潜伏期，CPU的存储访问启动不适合这些应用程序。 GPU发射的存储从存储控制路径中删除了这些开销，因此可以潜在地以更高的速度支持这些应用程序。但是，缺乏系统体系结构和软件堆栈，可实现有效的GPU发射的存储访问。这项工作提出了一种新颖的系统架构BAM，填补了这一空白。 BAM具有一个精细的软件缓存，以合并数据存储请求，同时最大程度地减少I/O流量放大。该软件缓存通过高通量队列与存储系统进行通信，这使现代GPU中的大量并发线程以高速率发出I/O请求，以充分利用存储设备和系统互连。实验结果表明，BAM为BFS和CC Graph Analytics基准提供了1.0倍和1.49倍的端到端速度，同时通过从主机存储器访问图形数据的情况下，将硬件成本降低了21.7倍。此外，在同一硬件上，BAM在CPU引起的存储访问范围内将数据分析的工作负载加快了5.3倍。

Graphics Processing Units (GPUs) have traditionally relied on the host CPU to initiate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion in the GPU. However, emerging applications such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU initiation of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU-initiated storage removes these overheads from the storage control path and, thus, can potentially support these applications at much higher speed. However, there is a lack of systems architecture and software stack that enable efficient GPU-initiated storage access. This work presents a novel system architecture, BaM, that fills this gap. BaM features a fine-grained software cache to coalesce data storage requests while minimizing I/O traffic amplification. This software cache communicates with the storage system via high-throughput queues that enable the massive number of concurrent threads in modern GPUs to make I/O requests at a high rate to fully utilize the storage devices and the system interconnect. Experimental results show that BaM delivers 1.0x and 1.49x end-to-end speed up for BFS and CC graph analytics benchmarks while reducing hardware costs by up to 21.7x over accessing the graph data from the host memory. Furthermore, BaM speeds up data-analytics workloads by 5.3x over CPU-initiated storage access on the same hardware.

下载PDF全文

下载文献需遵守相关版权规定

论文标题