论文标题

多项式时间中无水平的加固学习:固定政策的力量

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

论文作者

Zhang, Zihan, Ji, Xiangyang, Du, Simon S.

论文摘要

本文为表格马尔可夫决策过程(MDP)提供了第一个多项式时间算法,该算法享受了遗憾的\ emph {独立于计划范围}。具体来说,我们考虑具有$ S $州的表格MDP,$ A $ ACTICY,计划范围$ h $,总奖励为$ 1 $,代理商播放$ k $ evipodes。 We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $ s $〜\ citep {li2021settling}。我们的结果取决于一系列新的结构引理,以建立固定策略的近似能力,稳定性和浓度特性,这些策略可以在与马尔可夫链有关的其他问题中应用。

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源