论文标题
自动机学习遇到屏蔽
Automata Learning meets Shielding
论文作者
论文摘要
安全仍然是加强学习(RL)的主要研究挑战之一。在本文中,我们解决了如何避免在概率和部分未知环境中探索过程中避免安全侵犯RL代理的问题。我们的方法结合了马尔可夫决策过程(MDP)的自动机学习和迭代方法中的盾牌合成。最初,代表环境的MDP尚不清楚。代理人开始探索环境并收集痕迹。从收集的痕迹中,我们被动地学习了MDP,这些MDP抽象地代表了与安全相关的环境方面。鉴于学识渊博的MDP和安全规范,我们构建了盾牌。对于学习的MDP中的每个州行动对,Shield计算执行操作的可能性的确切概率导致在接下来的$ K $步骤中违反当前状态的规范。构造盾牌后,在运行时使用屏蔽,并阻止任何引起代理风险过多的动作。屏蔽代理继续探索环境并收集有关环境的新数据。迭代地,我们使用收集的数据以更高的精度学习新的MDP,从而导致盾牌能够防止更多的安全违规行为。我们实施了方法,并介绍了探索湿滑环境的Q学习代理的详细案例研究。在我们的实验中,我们表明,当代理商在培训期间探索越来越多的环境时,改进的学识渊博的模型会导致能够防止许多安全违规行为的盾牌。
Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.