在增强学习中进行时间协调的探索的生成计划

论文标题

在增强学习中进行时间协调的探索的生成计划

Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning

论文作者

Zhang, Haichao, Xu, Wei, Yu, Haonan

论文摘要

标准的无模型增强学习算法优化了一种在当前时间步骤中生成要采取的动作的策略，以最大程度地提高预期的未来回报。虽然灵活，但由于其一步性质而造成的效率低下的探索造成的困难。在这项工作中，我们提出了生成计划方法（GPM），该方法不仅可以为当前步骤生成动作，还可以为将来的许多步骤（因此称为生成计划）。这给gpm带来了一些好处。首先，由于GPM是通过最大化价值来训练的，因此可以将其产生的计划视为达到高价值区域的故意动作序列。因此，GPM可以利用其生成的多步计划来对高价值区域进行时间协调的探索，这可能比通过在单步中扰动每个动作而产生的一系列动作的序列，其一致的移动与探索步骤的数量成倍地衰减。其次，从原始的初始计划生成器开始，gpm可以将其完善以适应该任务，从而使该任务受益。这可能比常用的动作重复策略更有效，该战略在其计划的形式形式上是非自适应的。此外，由于多步计划可以将其解释为从现在到未来的时间段的代理商的意图，因此它为解释提供了更有信息和直观的信号。实验是在几个基准环境上进行的，与几种基线方法相比，结果证明了其有效性。

Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题