论文标题
人类在循环中:具有一般功能近似的可证明有效的基于偏好的增强学习
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation
论文作者
论文摘要
我们以轨迹偏好研究了人类在循环增强学习(RL),在每个步骤中,代理人没有在每个步骤中获得数字奖励,而是只收到与人类监督者的轨迹对的偏好。代理商的目标是学习人类监督者最喜欢的最佳政策。尽管取得了经验成功,但对基于偏好的RL(PBRL)的理论理解仅限于表格情况。在本文中,我们提出了具有一般函数近似值的PBRL的第一个基于乐观的模型算法,该算法使用价值定位的回归估算模型,并通过解决乐观的计划问题来计算探索性策略。我们的算法实现了$ \ tilde {o}(\ propatorname {poly}(d h)\ sqrt {k})$,其中$ d $是过渡和偏好模型的复杂度量,取决于Eluder dimension和log-Covering dogension和log-cod-copermention and log-cod-coperning and $ h $ $ k $ k $ k $ k $ k $ is ecopties and invoce and norkes and norvise norvise norvise norvise norvise norvise tope, o(\ cdot)$省略对数项。我们的下限表明当专门针对线性设置时,我们的算法几乎是最佳的。此外,我们通过制定一个名为RL的新型问题,并提供了$ n $ wise的比较,并为此新设置提供了第一种样品效率算法,从而扩展了PBRL问题。据我们所知,这是PBRL具有(一般)函数近似的第一个理论结果。
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.