在强化学习框架中的脚手架反射逃脱问题

论文标题

在强化学习框架中的脚手架反射逃脱问题

Scaffolding Reflection in Reinforcement Learning Framework for Confinement Escape Problem

论文作者

Mohanty, Nishant, Sundaram, Suresh

论文摘要

在本文中，提出了一种新颖的脚手架反思（SR2L）来解决隔离逃生问题（CEP）。在CEP中，逃避者的目标是尝试逃避多个追随者巡逻的监禁区。同时，追随者的目标是达到和捕获逃避者。在文献中对追随者尝试和捕获的逆解决方案进行了广泛的研究。但是，从该地区逃脱的问题仍然是一个悬而未决的问题。 SR2L采用参与者批评的框架，使逃避者能够逃脱监禁区域。为了适当的收敛，已经开发了随时间变化的状态表示和奖励函数。该公式使用有关可观察到的环境和限制边界的先验知识的传感器信息。传统的独立参与者批评（IAC）方法由于奖励的稀疏而无法融合。当在具有较大面积的动态环境中运行时，效果变得显而易见。在SR2L中，加上开发的奖励函数，我们使用脚手架反射方法来显着改善收敛性，同时提高其效率。在SR2L中，运动计划者用作参与者批评网络的脚手架，以观察，比较和学习动作奖励对。它使逃避者能够在使用较少的资源和时间时实现所需的目标。收敛研究表明，与IAC相比，SR2L学习更快并收敛到更高的奖励。广泛的蒙特卡洛模拟表明，SR2L始终优于常规IAC，运动计划者本身就是基准。

In this paper, a novel Scaffolding Reflection in Reinforcement Learning (SR2L) is proposed for solving the confinement escape problem (CEP). In CEP, an evader's objective is to attempt escaping a confinement region patrolled by multiple pursuers. Meanwhile, the pursuers aim to reach and capture the evader. The inverse solution for pursuers to try and capture has been extensively studied in the literature. However, the problem of evaders escaping from the region is still an open issue. The SR2L employs an actor-critic framework to enable the evader to escape the confinement region. A time-varying state representation and reward function have been developed for proper convergence. The formulation uses the sensor information about the observable environment and prior knowledge of the confinement boundary. The conventional Independent Actor-Critic (IAC) method fails to converge due to sparseness in the reward. The effect becomes evident when operating in such a dynamic environment with a large area. In SR2L, along with the developed reward function, we use the scaffolding reflection method to improve the convergence significantly while increasing its efficiency. In SR2L, a motion planner is used as a scaffold for the actor-critic network to observe, compare and learn the action-reward pair. It enables the evader to achieve the required objective while using lesser resources and time. Convergence studies show that SR2L learns faster and converges to higher rewards as compared to IAC. Extensive Monte-Carlo simulations show that a SR2L consistently outperforms conventional IAC and the motion planner itself as the baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题