人类在循环中：具有一般功能近似的可证明有效的基于偏好的增强学习

论文标题

人类在循环中：具有一般功能近似的可证明有效的基于偏好的增强学习

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

论文作者

Chen, Xiaoyu, Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, Wang, Liwei

论文摘要

我们以轨迹偏好研究了人类在循环增强学习（RL），在每个步骤中，代理人没有在每个步骤中获得数字奖励，而是只收到与人类监督者的轨迹对的偏好。代理商的目标是学习人类监督者最喜欢的最佳政策。尽管取得了经验成功，但对基于偏好的RL（PBRL）的理论理解仅限于表格情况。在本文中，我们提出了具有一般函数近似值的PBRL的第一个基于乐观的模型算法，该算法使用价值定位的回归估算模型，并通过解决乐观的计划问题来计算探索性策略。我们的算法实现了$ \ tilde {o}（\ propatorname {poly}（d h）\ sqrt {k}）$，其中$ d $是过渡和偏好模型的复杂度量，取决于Eluder dimension和log-Covering dogension和log-cod-copermention and log-cod-coperning and $ h $ $ k $ k $ k $ k $ k $ is ecopties and invoce and norkes and norvise norvise norvise norvise norvise norvise tope， o（\ cdot）$省略对数项。我们的下限表明当专门针对线性设置时，我们的算法几乎是最佳的。此外，我们通过制定一个名为RL的新型问题，并提供了$ n $ wise的比较，并为此新设置提供了第一种样品效率算法，从而扩展了PBRL问题。据我们所知，这是PBRL具有（一般）函数近似的第一个理论结果。

We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题