论文标题

面对环境的乐观:遗憾保证为随机上下文的MDP保证

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

论文作者

Levy, Orin, Mansour, Yishay

论文摘要

我们使用访问离线最小二乘回归甲骨文的访问权限,在最低可及性假设下为随机上下文MDP的最小化算法提供了遗憾的最小化算法。我们分析了三种不同的设置:在何处动力学,动力学是未知的,但独立于上下文和最具挑战性的设置,而动力学是未知且依赖上下文的。对于后者,我们的算法获得了$ \ widetilde {o}的遗憾( (h+{1}/{p_ {min}}) $ \ Mathcal {g} $是用于分别近似动态和奖励的有限且可实现的功能类,$ p_ {min} $是最低可及性参数,$ s $是一组状态,$ a $ a $ a $ a contions,$ h $ the Horizo​​n和$ t $ t $ evistodes of Evistess。据我们所知,我们的方法是使用具有一般函数近似的上下文MDP的第一种乐观方法(即,没有有关功能类别的其他知识,例如线性等)。我们提出了$ω(\ sqrt {t h | s | | | \ ln(| \ Mathcal {g} |)/\ ln(| a |)} $的下限,即使在已知的动力学的情况下,也会产生预期的遗憾。最后,我们在没有最低达到性的情况下讨论了我们的结果扩展到CMDP,从而获得了$ \ widetilde {o}(t^{3/4})$遗憾。

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/δ)})$ with probability $1-δ$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $Ω(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源