保守的安全批评家探索

论文标题

保守的安全批评家探索

Conservative Safety Critics for Exploration

论文作者

Bharadhwaj, Homanga, Kumar, Aviral, Rhinehart, Nicholas, Levine, Sergey, Shkurti, Florian, Garg, Animesh

论文摘要

安全勘探提出了加强学习（RL）的主要挑战：当主动数据收集需要部署部分训练的政策时，我们必须确保这些政策避免灾难性不安全的区域，同时仍启用试验和错误学习。在本文中，我们通过批评家学习保守的环境安全估算，以解决RL的安全探索问题，并在每次培训迭代中都遇到了灾难性失败的可能性。从理论上讲，我们表征了安全和政策改进之间的权衡，表明安全限制在训练过程中的可能性很高，可以为我们的方法提供可证明的融合保证，这与标准RL的渐近差不差，并证明了拟议方法在提出了挑战性的导航，操纵性，操纵性，操纵性，和现场任务的套件上的功效。从经验上讲，我们表明所提出的方法可以实现竞争性任务绩效，同时在训练过程中的灾难性失败率显着降低，而不是先前的方法。视频在此url https://sites.google.com/view/conservative-safety-critics/home

Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https://sites.google.com/view/conservative-safety-critics/home

下载PDF全文

下载文献需遵守相关版权规定

论文标题