论文标题
潜在的现场指导参与者批判性的增强学习
Potential Field Guided Actor-Critic Reinforcement Learning
论文作者
论文摘要
在本文中,我们考虑了参与者 - 批判性强化学习的问题。首先,我们通过引入更多的批评家,将演员批评的建筑扩展到了演员 - 批判性的建筑。其次,我们将基于奖励的批评家与潜在的基于现场的批评家相结合,以制定拟议的潜在田野指导的参与者 - 批判性强化学习方法(Actor-Critic-2)。这可以看作是基于模型的梯度和政策改进中的无模型梯度的组合。具有较大电位领域的状态通常包含强大的先前信息,例如长距离指向目标或避免障碍物侧面碰撞。在这种情况下,我们应该更多地信任基于潜在的现场批评家作为政策评估,以加速政策的改善,而行动政策往往得到指导。例如,在实际应用中,应指导避免障碍,而不是通过反复试验来学习障碍。提出较小潜力的状态通常缺乏信息,例如在当地最小点或移动目标周围。目前,我们应该将基于奖励的批评家作为政策评估,以评估长期回报。在这种情况下,行动政策倾向于探索。此外,潜在的现场评估可以与计划结合使用,以估算更好的状态价值函数。这样,奖励设计可以更多地集中在奖励的最后阶段,而不是奖励成型或分阶段奖励。此外,潜在的现场评估可以弥补多代理合作问题缺乏沟通,即多代理都有一个基于奖励的评论家和具有先验信息的相对统一的基于统一的基于统一的基于潜在的批评家。第三,对Predator-Prey游戏的简化实验证明了该方法的有效性。
In this paper, we consider the problem of actor-critic reinforcement learning. Firstly, we extend the actor-critic architecture to actor-critic-N architecture by introducing more critics beyond rewards. Secondly, we combine the reward-based critic with a potential-field-based critic to formulate the proposed potential field guided actor-critic reinforcement learning approach (actor-critic-2). This can be seen as a combination of the model-based gradients and the model-free gradients in policy improvement. State with large potential field often contains a strong prior information, such as pointing to the target at a long distance or avoiding collision by the side of an obstacle. In this situation, we should trust potential-field-based critic more as policy evaluation to accelerate policy improvement, where action policy tends to be guided. For example, in practical application, learning to avoid obstacles should be guided rather than learned by trial and error. State with small potential filed is often lack of information, for example, at the local minimum point or around the moving target. At this time, we should trust reward-based critic as policy evaluation more to evaluate the long-term return. In this case, action policy tends to explore. In addition, potential field evaluation can be combined with planning to estimate a better state value function. In this way, reward design can focus more on the final stage of reward, rather than reward shaping or phased reward. Furthermore, potential field evaluation can make up for the lack of communication in multi-agent cooperation problem, i.e., multi-agent each has a reward-based critic and a relative unified potential-field-based critic with prior information. Thirdly, simplified experiments on predator-prey game demonstrate the effectiveness of the proposed approach.