论文标题

用线性函数近似进行增强学习的对数遗憾

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

论文作者

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

论文摘要

具有线性函数近似的增强学习(RL)最近受到了越来越多的关注。但是,现有的工作重点是获得$ \ sqrt {t} $ - 类型遗憾,其中$ t $是与MDP的交互的数量。在本文中,我们表明,在最近提出的两个线性MDP假设下,就对数遗憾是可以实现的,前提是对于最佳的动作值函数存在积极的次数差距。更具体地说,在线性MDP假设(Jin等人,2019年)下,LSVI-UCB算法可以实现$ \ tilde {o}(d^{3} h^5/\ text {gap} {gap} _ {\ text {\ text {min}}}}}} \ cdot \ cdot \ log(t)在线性混合MDP假设(Ayoub等,2020)下,UCRL-VTR算法可以实现$ \ tilde {o}(d^{2} h^5/\ text {gap} _ {\ text {\ text {\ text}情节,$ \ text {gap} _ {\ text {min}} $是最小的次优差距,而$ \ tilde o $ hives hives of $ \ log log(t)$除外所有对数项。据我们所知,这些是带有线性函数近似的RL的第一个对数后悔范围。我们还为两个线性MDP模型建立了依赖GAP的下限。

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源