用线性函数近似进行增强学习的对数遗憾

论文标题

用线性函数近似进行增强学习的对数遗憾

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

论文作者

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

论文摘要

具有线性函数近似的增强学习（RL）最近受到了越来越多的关注。但是，现有的工作重点是获得$ \ sqrt {t} $ - 类型遗憾，其中$ t $是与MDP的交互的数量。在本文中，我们表明，在最近提出的两个线性MDP假设下，就对数遗憾是可以实现的，前提是对于最佳的动作值函数存在积极的次数差距。更具体地说，在线性MDP假设（Jin等人，2019年）下，LSVI-UCB算法可以实现$ \ tilde {o}（d^{3} h^5/\ text {gap} {gap} _ {\ text {\ text {min}}}}}} \ cdot \ cdot \ log（t）在线性混合MDP假设（Ayoub等，2020）下，UCRL-VTR算法可以实现$ \ tilde {o}（d^{2} h^5/\ text {gap} _ {\ text {\ text {\ text}情节，$ \ text {gap} _ {\ text {min}} $是最小的次优差距，而$ \ tilde o $ hives hives of $ \ log log（t）$除外所有对数项。据我们所知，这些是带有线性函数近似的RL的第一个对数后悔范围。我们还为两个线性MDP模型建立了依赖GAP的下限。

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题