CMDP具有非平稳目标和约束的CMDP的可证明有效的原始双重加固学习

论文标题

CMDP具有非平稳目标和约束的CMDP的可证明有效的原始双重加固学习

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

论文作者

Ding, Yuhao, Lavaei, Javad

论文摘要

我们考虑具有非平稳目标和约束的情节约束马尔可夫决策过程（CMDP）中的基于原始的双重增强学习（RL），这在确保时间变化环境中的RL安全方面起着核心作用。在这个问题中，只要它们的累积变化不超过某些已知的变化预算，奖励/公用事业功能和国家过渡功能都可以随时间随时间变化。在时间变化的环境中设计安全的RL算法特别具有挑战性，因为需要将减少限制，安全探索和适应非平稳性的限制。为此，我们确定了两种替代条件，从长远来看，我们可以保证安全的安全性。 We also propose the \underline{P}eriodically \underline{R}estarted \underline{O}ptimistic \underline{P}rimal-\underline{D}ual \underline{P}roximal \underline{P}olicy \underline{O}ptimization (PROPD-PPO) algorithm that can与两个条件进行协调。此外，在两个替代条件下，在线性内核CMDP函数近似设置和表格CMDP设置中，为所提出的算法建立了动态遗憾绑定和约束违规绑定。本文为非平稳CMDP提供了第一种可证明的有效算法，并提供了安全探索。

We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which plays a central role in ensuring the safety of RL in time-varying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we identify two alternative conditions on the time-varying constraints under which we can guarantee the safety in the long run. We also propose the \underline{P}eriodically \underline{R}estarted \underline{O}ptimistic \underline{P}rimal-\underline{D}ual \underline{P}roximal \underline{P}olicy \underline{O}ptimization (PROPD-PPO) algorithm that can coordinate with both two conditions. Furthermore, a dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative conditions. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题