型号的演员 - 批评：通过路径反向传播

论文标题

型号的演员 - 批评：通过路径反向传播

Model-Augmented Actor-Critic: Backpropagating through Paths

论文作者

Clavera, Ignasi, Fu, Violet, Abbeel, Pieter

论文摘要

基于模型的强化学习方法将模型简单地用作学习的黑框模拟器，以增强数据以进行策略优化或价值功能学习。在本文中，我们展示了如何通过利用其可不同的性能来更有效地利用该模型。我们构建了一种策略优化算法，该算法在未来的时间段中使用了学习模型和策略的路线衍生物。通过使用最终价值功能，以参与者的方式学习政策，可以防止许多时间步中学习的不稳定性。此外，我们在模型和值函数中的梯度误差方面介绍了目标的单调改进。我们表明，我们的方法（i）始终比现有的基于模型的现有算法更有效，（ii）与无模型算法的渐近性能匹配，（iii）量表与长度范围相匹配，这是一种典型的基于模型的方法的状态。

Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper, we show how to make more effective use of the model by exploiting its differentiability. We construct a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps. Instabilities of learning across many timesteps are prevented by using a terminal value function, learning the policy in an actor-critic fashion. Furthermore, we present a derivation on the monotonic improvement of our objective in terms of the gradient error in the model and value function. We show that our approach (i) is consistently more sample efficient than existing state-of-the-art model-based algorithms, (ii) matches the asymptotic performance of model-free algorithms, and (iii) scales to long horizons, a regime where typically past model-based approaches have struggled.

下载PDF全文

下载文献需遵守相关版权规定

论文标题