在连续控制中对策略迭代的本地搜索

论文标题

在连续控制中对策略迭代的本地搜索

Local Search for Policy Iteration in Continuous Control

论文作者

Springenberg, Jost Tobias, Heess, Nicolas, Mankowitz, Daniel, Merel, Josh, Byravan, Arunkumar, Abdolmaleki, Abbas, Kay, Jackie, Degrave, Jonas, Schrittwieser, Julian, Tassa, Yuval, Buchli, Jonas, Belov, Dan, Riedmiller, Martin

论文摘要

我们提出了一种针对本地，正规化的，加固学习（RL）的算法，该算法使我们能够在一个框架中制定基于模型和模型的变体。我们的算法可以解释为在KL批准的RL上的自然扩展，并引入了对连续作用空间的树木搜索形式。我们证明，在学习过程中花费在基于模型的策略改进上的其他计算可以提高数据效率，并确认在选择行动选择过程中基于模型的策略改进也可能是有益的。从数量上讲，我们的算法提高了几个连续控制基准的数据效率（当并行学习模型时），并且在高维域中（当可用地面真相模型）中提供了显着改善。统一的框架还有助于我们更好地了解基于模型和无模型的算法的空间。特别是，我们证明，只需利用更多的计算即可获得基于模型的RL所归因于基于模型的RL的一些好处。

We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题