安全意识强化学习（SARL）

论文标题

安全意识强化学习（SARL）

Safety Aware Reinforcement Learning (SARL)

论文作者

Miret, Santiago, Majumdar, Somdeb, Wainwright, Carroll

论文摘要

随着加强学习者越来越多地整合到复杂的现实世界环境中，为安全设计而成为一个关键的考虑。我们特别专注于研究代理可以在主要任务执行策略时会引起不希望的副作用的方案。由于可以为给定环境动态定义多个任务，因此面临两个重要的挑战。首先，我们需要抽象安全的概念，该概念广泛适用于该环境，而不是执行的特定任务。其次，我们需要一种抽象的安全概念的机制，以调节执行不同策略的代理人的作用以最大程度地减少其副作用。在这项工作中，我们提出了安全意识强化学习（SARL） - 虚拟安全代理调节基于奖励的代理以最小化副作用的框架。安全代理商在给定环境中学习了与任务无关的安全概念。然后，对主要代理进行训练，该训练是由两个代理的天然作用概率之间的距离给出的正规化损失。由于安全代理通过其动作概率有效地抽象了与任务无关的安全概念，因此可以将其移植以调节在没有进一步培训的情况下在给定环境中解决不同任务的多个策略。我们将其与依靠特定任务的正规化指标的解决方案进行对比，并根据Conway的生活游戏测试我们在Safelife Suite上的框架，其中包括动态环境中的许多复杂任务。我们表明，我们的解决方案能够匹配依靠特定于任务的副作用惩罚的解决方案的性能，同时还提供了概括性和可移植性的好处。

As reinforcement learning agents become increasingly integrated into complex, real-world environments, designing for safety becomes a critical consideration. We specifically focus on researching scenarios where agents can cause undesired side effects while executing a policy on a primary task. Since one can define multiple tasks for a given environment dynamics, there are two important challenges. First, we need to abstract the concept of safety that applies broadly to that environment independent of the specific task being executed. Second, we need a mechanism for the abstracted notion of safety to modulate the actions of agents executing different policies to minimize their side-effects. In this work, we propose Safety Aware Reinforcement Learning (SARL) - a framework where a virtual safe agent modulates the actions of a main reward-based agent to minimize side effects. The safe agent learns a task-independent notion of safety for a given environment. The main agent is then trained with a regularization loss given by the distance between the native action probabilities of the two agents. Since the safe agent effectively abstracts a task-independent notion of safety via its action probabilities, it can be ported to modulate multiple policies solving different tasks within the given environment without further training. We contrast this with solutions that rely on task-specific regularization metrics and test our framework on the SafeLife Suite, based on Conway's Game of Life, comprising a number of complex tasks in dynamic environments. We show that our solution is able to match the performance of solutions that rely on task-specific side-effect penalties on both the primary and safety objectives while additionally providing the benefit of generalizability and portability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题