在深度强化学习中优化数据收集

论文标题

在深度强化学习中优化数据收集

Optimizing Data Collection in Deep Reinforcement Learning

论文作者

Gleeson, James, Snider, Daniel, Yang, Yvonne, Gabel, Moshe, de Lara, Eyal, Pekhimenko, Gennady

论文摘要

强化学习（RL）的工作负载需要众所周知的很长时间才能进行训练，因为在运行时间从模拟器收集了大量样本。不幸的是，群集扩展方法仍然很昂贵，并且在GPU计算之间来回切换时，模拟器的常用CPU实现会诱导高空开销。我们探索了两种优化，通过增加GPU利用率来提高RL数据收集效率：（1）GPU矢量化：在GPU上平行模拟，以增加硬件并行性，以及（2）模拟器内核融合：融合多个模拟步骤，以在单个GPU内核启动中运行以减少全局内存bandwidth的需求。我们发现，与常用的CPU模拟器相比，GPU矢量化最多可以实现$ 1024 \ times $速度。我们介绍了不同实现的性能，并表明，对于简单的模拟器，GPU矢量化的ML编译器实现（XLA）的表现优于DNN Framework（Pytorch）$ 13.4 \ times $，通过将CPU从重复的Python降低到DL Backend API呼叫。我们表明，带有简单模拟器的模拟器内核融合加速度为$ 11.3 \ times $，并且随着模拟器复杂性在内存带宽要求方面的提高而增加了$ 1024 \ times $。我们表明，来自模拟器内核融合的加速度是正交的，可以与GPU矢量化结合，从而导致乘法加速。

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU implementations of simulators induce high overhead when switching back and forth between GPU computations. We explore two optimizations that increase RL data collection efficiency by increasing GPU utilization: (1) GPU vectorization: parallelizing simulation on the GPU for increased hardware parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps to run in a single GPU kernel launch to reduce global memory bandwidth requirements. We find that GPU vectorization can achieve up to $1024\times$ speedup over commonly used CPU simulators. We profile the performance of different implementations and show that for a simple simulator, ML compiler implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch) by $13.4\times$ by reducing CPU overhead from repeated Python to DL backend API calls. We show that simulator kernel fusion speedups with a simple simulator are $11.3\times$ and increase by up to $1024\times$ as simulator complexity increases in terms of memory bandwidth requirements. We show that the speedups from simulator kernel fusion are orthogonal and combinable with GPU vectorization, leading to a multiplicative speedup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题