返回条件的监督学习工作何时用于离线强化学习？

论文标题

返回条件的监督学习工作何时用于离线强化学习？

When does return-conditioned supervised learning work for offline reinforcement learning?

论文作者

Brandfonbrener, David, Bietti, Alberto, Buckman, Jacob, Laroche, Romain, Bruna, Joan

论文摘要

最近的一些作品提出了一类用于离线增强学习（RL）问题的算法，我们将其称为返回条件的监督学习（RCSL）。 RCSL算法学习以状态和轨迹返回为条件的动作的分布。然后，他们通过实现高回报来定义政策。在本文中，我们对RCSL的能力和局限性进行了严格的研究，这在以前的工作中至关重要。我们发现，RCSL在一组假设下返回最佳策略，这些假设比基于动态的算法所需的算法更强大。我们提供了MDP和数据集的特定示例，这些示例说明了这些假设的必要性以及RCSL的限制。最后，我们提供了经验证据，表明这些限制也将通过在简单的点质量环境和D4RL基准的数据集中提供说明性实验来引起实践中的问题。

Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题