通过策略限制和Q汇编从演示中加速自我记录学习

论文标题

通过策略限制和Q汇编从演示中加速自我记录学习

Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

论文作者

Li, Chao

论文摘要

深度强化学习（DRL）提供了一种生成机器人控制策略的新方法。但是，培训控制政策的过程需要冗长的探索，从而导致了实际任务中强化学习（RL）的样本效率较低。模仿学习（IL）和通过使用专家演示改善培训过程的仿制学习（IL）和从事的培训过程，但不完美的专家示范会误导政策的改善。离线到在线加强学习需要大量离线数据来初始化策略，并且分销转移很容易导致在线微调期间的性能下降。为了解决上述问题，我们从名为A-Silfd的演示方法中提出了一项学习，该方法将专家演示视为代理商的成功经验，并利用经验来限制政策的改进。此外，由于集成Q-功能在Q功能中估计误差很大，我们可以防止性能降解。我们的实验表明，使用少量不同质量的专家演示，A-SILFD可以显着提高样品效率。在四个Mujoco连续控制任务中，A-SILFD在150,000个在线培训后可以大大超过基线方法，并且不会在培训期间被不完美的专家示范误导。

Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题