论文标题
从基线开始的安全校正:通过双手强化学习来迈向机器人技术的风险感知政策
Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning
论文作者
论文摘要
学习风险感知的政策是必不可少的,但在非结构化的机器人任务中具有挑战性。安全的加固学习方法开辟了解决此问题的新可能性。但是,保守的政策更新使得在复杂的样本昂贵的环境中实现足够的探索和理想的表现非常棘手。在本文中,我们提出了一个由基线和安全代理组成的双重机构安全加强学习策略。这样的脱钩框架可实现基于RL的控制的高灵活性,数据效率和风险意识。具体而言,基线代理负责在标准RL设置下最大化奖励。因此,它与无约束优化,探索和剥削的现成训练技术兼容。另一方面,安全代理模仿了基准代理,以改善政策,并学会通过非政策RL调整来实现安全限制。与从头开始的培训相反,安全政策更正需要更少的互动才能获得近乎最佳的政策。双重策略可以通过共享的重播缓冲区同步优化,也可以利用预先训练的模型或非学习的控制器作为固定基线代理。实验结果表明,我们的方法可以在没有事先知识的情况下学习可行的技能,并从预先训练的不安全政策中获得规避风险的对应物。关于安全限制满意度和样本效率,该方法的表现优于最新的机器人运动和操纵任务上的最新安全RL算法。
Learning a risk-aware policy is essential but rather challenging in unstructured robotic tasks. Safe reinforcement learning methods open up new possibilities to tackle this problem. However, the conservative policy updates make it intractable to achieve sufficient exploration and desirable performance in complex, sample-expensive environments. In this paper, we propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. Concretely, the baseline agent is responsible for maximizing rewards under standard RL settings. Thus, it is compatible with off-the-shelf training techniques of unconstrained optimization, exploration and exploitation. On the other hand, the safe agent mimics the baseline agent for policy improvement and learns to fulfill safety constraints via off-policy RL tuning. In contrast to training from scratch, safe policy correction requires significantly fewer interactions to obtain a near-optimal policy. The dual policies can be optimized synchronously via a shared replay buffer, or leveraging the pre-trained model or the non-learning-based controller as a fixed baseline agent. Experimental results show that our approach can learn feasible skills without prior knowledge as well as deriving risk-averse counterparts from pre-trained unsafe policies. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks with respect to both safety constraint satisfaction and sample efficiency.