多项式时间中无水平的加固学习：固定政策的力量

论文标题

多项式时间中无水平的加固学习：固定政策的力量

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

论文作者

Zhang, Zihan, Ji, Xiangyang, Du, Simon S.

论文摘要

本文为表格马尔可夫决策过程（MDP）提供了第一个多项式时间算法，该算法享受了遗憾的\ emph {独立于计划范围}。具体来说，我们考虑具有$ S $州的表格MDP，$ A $ ACTICY，计划范围$ h $，总奖励为$ 1 $，代理商播放$ k $ evipodes。 We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $ s $〜\ citep {li2021settling}。我们的结果取决于一系列新的结构引理，以建立固定策略的近似能力，稳定性和浓度特性，这些策略可以在与马尔可夫链有关的其他问题中应用。

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题