绕过模拟对真实差距：使用主管的在线加强学习

论文标题

绕过模拟对真实差距：使用主管的在线加强学习

Bypassing the Simulation-to-reality Gap: Online Reinforcement Learning using a Supervisor

论文作者

Evans, Benjamin David, Betz, Johannes, Zheng, Hongrui, Engelbrecht, Herman A., Mangharam, Rahul, Jordaan, Hendrik W.

论文摘要

深度强化学习（DRL）是一种仅从演示和经验中学习机器人控制政策的有前途的方法。为了涵盖机器人的整个动态行为，DRL训练是通常在模拟环境中执行的主动探索过程。尽管这种模拟培训廉价且快速，但是将DRL算法应用于现实世界的设置很困难。如果对代理进行训练直到它们在模拟中安全执行，则由于模拟动力学和物理机器人之间的差异引起的SIM到实现差距，将其传输到物理系统很困难。在本文中，我们提出了一种在线培训DRL代理的方法，可以使用基于模型的安全主管在物理车辆上自动驾驶。我们的解决方案使用监督系统检查代理选择的操作是安全还是不安全，并确保在车辆上始终采取安全措施。这样，我们可以在安全，快速，有效地训练DRL算法时绕过SIM到现实的问题。我们将方法与模拟和物理工具中的常规学习进行比较。我们提供各种现实世界实验，在线培训一辆小型车辆以自主行驶而没有事先模拟培训。评估结果表明，我们的方法在未崩溃的同时提高了样品效率的训练代理，而受过训练的代理比在模拟中训练的驾驶性能更好。

Deep reinforcement learning (DRL) is a promising method to learn control policies for robots only from demonstration and experience. To cover the whole dynamic behaviour of the robot, DRL training is an active exploration process typically performed in simulation environments. Although this simulation training is cheap and fast, applying DRL algorithms to real-world settings is difficult. If agents are trained until they perform safely in simulation, transferring them to physical systems is difficult due to the sim-to-real gap caused by the difference between the simulation dynamics and the physical robot. In this paper, we present a method of online training a DRL agent to drive autonomously on a physical vehicle by using a model-based safety supervisor. Our solution uses a supervisory system to check if the action selected by the agent is safe or unsafe and ensure that a safe action is always implemented on the vehicle. With this, we can bypass the sim-to-real problem while training the DRL algorithm safely, quickly, and efficiently. We compare our method with conventional learning in simulation and on a physical vehicle. We provide a variety of real-world experiments where we train online a small-scale vehicle to drive autonomously with no prior simulation training. The evaluation results show that our method trains agents with improved sample efficiency while never crashing, and the trained agents demonstrate better driving performance than those trained in simulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题