分布式强化学习对CPU-GPU系统的建筑含义

论文标题

分布式强化学习对CPU-GPU系统的建筑含义

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

论文作者

Inci, Ahmet, Bolotin, Evgeny, Fu, Yaosheng, Dalal, Gal, Mannor, Shie, Nellans, David, Marculescu, Diana

论文摘要

通过深入的强化学习（RL）方法，可以实现超过人类在游戏，机器人技术和模拟环境中的能力的结果，RL培训的持续扩展对于解决复杂的现实世界中的问题而言至关重要。但是，通过了解CPU-GPU系统的建筑意义来提高RL培训的性能可扩展性和功率效率仍然是一个开放的问题。在这项工作中，我们通过不仅从GPU微体系结构的角度来解决问题，而是遵循整体系统级分析方法，调查和提高CPU-GPU系统上分布式RL培训的性能和功率效率。我们量化了最先进的分布式RL训练框架上的整体硬件利用，并从经验上识别由GPU微体系构造，算法和系统级设计选择引起的瓶颈。我们表明，GPU微构造本身对于最先进的RL框架平衡，但进一步的调查表明，运行环境交互的参与者数量以及可用的硬件资源的数量是主要的性能和功率效率限制器。为此，我们介绍了一个新的系统设计指标，CPU/GPU比率，并在设计可扩展和高效的CPU-GPU系统以用于RL培训时如何找到CPU和GPU资源之间的最佳平衡。

With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efficiency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem. In this work we investigate and improve the performance and power efficiency of distributed RL training on CPU-GPU systems by approaching the problem not solely from the GPU microarchitecture perspective but following a holistic system-level analysis approach. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework and empirically identify the bottlenecks caused by GPU microarchitectural, algorithmic, and system-level design choices. We show that the GPU microarchitecture itself is well-balanced for state-of-the-art RL frameworks, but further investigation reveals that the number of actors running the environment interactions and the amount of hardware resources available to them are the primary performance and power efficiency limiters. To this end, we introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources when designing scalable and efficient CPU-GPU systems for RL training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题