随机最短路径的策略优化

论文标题

随机最短路径的策略优化

Policy Optimization for Stochastic Shortest Path

论文作者

Chen, Liyu, Luo, Haipeng, Rosenberg, Aviv

论文摘要

政策优化是最受欢迎，最成功的强化学习算法之一，并且对理解其理论保证的兴趣越来越多。在这项工作中，我们启动针对随机最短路径（SSP）问题的策略优化研究，这是一种面向目标的增强学习模型，严格概括了有限的Horizon模型并更好地捕获了许多应用程序。我们考虑了各种环境，包括在完整信息或匪徒反馈下的随机和对抗环境，并为每种设置提出了一种策略优化算法，该算法使用新颖的校正项和/或扩张奖金的变体（Luo等，2021）。对于大多数设置，我们的算法被证明可以实现近乎最理想的遗憾。这项工作的一项关键技术贡献是一种新的近似方案，以解决我们称为\ textit {堆叠式折扣近似}的SSP问题，并在我们建议的所有算法中使用。与最近在SSP算法中大量使用的有限摩托近似不同，我们的新近似使我们能够学习近乎固定的策略，并且在情节期间仅具有对数变化，并且可能会导致空间复杂性的指数提高。

Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees. In this work, we initiate the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications. We consider a wide range of settings, including stochastic and adversarial environments under full information or bandit feedback, and propose a policy optimization algorithm for each setting that makes use of novel correction terms and/or variants of dilated bonuses (Luo et al., 2021). For most settings, our algorithm is shown to achieve a near-optimal regret bound. One key technical contribution of this work is a new approximation scheme to tackle SSP problems that we call \textit{stacked discounted approximation} and use in all our proposed algorithms. Unlike the finite-horizon approximation that is heavily used in recent SSP algorithms, our new approximation enables us to learn a near-stationary policy with only logarithmic changes during an episode and could lead to an exponential improvement in space complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题