何时去以及何时探索：探索后的益处的好处

论文标题

何时去以及何时探索：探索后的益处的好处

When to Go, and When to Explore: The Benefit of Post-Exploration in Intrinsic Motivation

论文作者

Yang, Zhao, Moerland, Thomas M., Preuss, Mike, Plaat, Aske

论文摘要

Go-explore在具有稀疏奖励的具有挑战性的加强学习（RL）任务上取得了突破性的表现。 Go-explore的关键见解是，成功的探索要求代理商首先返回一个有趣的状态（“ Go”），然后才探索未知的地形（“ Explore”）。在将目标实现后，我们将这种探索称为“探索后”。在本文中，我们介绍了一项系统的探索后研究，回答了《 Go-explore纸》尚未回答的开放问题。首先，我们通过在同一算法中将其打开和关闭来研究探索后的孤立潜力。随后，我们介绍了新的方法，以自适应地决定何时进行探索以及在探索后多长时间。在一系列碎屑环境上进行的实验表明，探索后确实可以提高性能（比调整常规探索参数具有更大的影响），并且通过自适应地决定何时和多长时间进行探索多长时间，进一步增强了这种效果。简而言之，我们的工作将自适应后的探索后确定为RL勘探研究的有希望的方向。

Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper we present a systematic study of post-exploration, answering open questions that the Go-Explore paper did not answer yet. First, we study the isolated potential of post-exploration, by turning it on and off within the same algorithm. Subsequently, we introduce new methodology to adaptively decide when to post-explore and for how long to post-explore. Experiments on a range of MiniGrid environments show that post-exploration indeed boosts performance (with a bigger impact than tuning regular exploration parameters), and this effect is further enhanced by adaptively deciding when and for how long to post-explore. In short, our work identifies adaptive post-exploration as a promising direction for RL exploration research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题