论文标题

何时去以及何时探索:探索后的益处的好处

When to Go, and When to Explore: The Benefit of Post-Exploration in Intrinsic Motivation

论文作者

Yang, Zhao, Moerland, Thomas M., Preuss, Mike, Plaat, Aske

论文摘要

Go-explore在具有稀疏奖励的具有挑战性的加强学习(RL)任务上取得了突破性的表现。 Go-explore的关键见解是,成功的探索要求代理商首先返回一个有趣的状态(“ Go”),然后才探索未知的地形(“ Explore”)。在将目标实现后,我们将这种探索称为“探索后”。在本文中,我们介绍了一项系统的探索后研究,回答了《 Go-explore纸》尚未回答的开放问题。首先,我们通过在同一算法中将其打开和关闭来研究探索后的孤立潜力。随后,我们介绍了新的方法,以自适应地决定何时进行探索以及在探索后多长时间。在一系列碎屑环境上进行的实验表明,探索后确实可以提高性能(比调整常规探索参数具有更大的影响),并且通过自适应地决定何时和多长时间进行探索多长时间,进一步增强了这种效果。简而言之,我们的工作将自适应后的探索后确定为RL勘探研究的有希望的方向。

Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper we present a systematic study of post-exploration, answering open questions that the Go-Explore paper did not answer yet. First, we study the isolated potential of post-exploration, by turning it on and off within the same algorithm. Subsequently, we introduce new methodology to adaptively decide when to post-explore and for how long to post-explore. Experiments on a range of MiniGrid environments show that post-exploration indeed boosts performance (with a bigger impact than tuning regular exploration parameters), and this effect is further enhanced by adaptively deciding when and for how long to post-explore. In short, our work identifies adaptive post-exploration as a promising direction for RL exploration research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源