论文标题
基于优先的加强学习有限保证
Preference-based Reinforcement Learning with Finite-Time Guarantees
论文作者
论文摘要
基于偏好的强化学习(PBRL)通过偏好替代奖励价值,而偏爱更好地引起人类对目标目标的看法,尤其是在数值奖励价值很难设计或解释时。尽管应用有希望的结果,但对PBRL的理论理解仍处于起步阶段。在本文中,我们介绍了通用PBRL问题的第一个有限时间分析。我们首先表明,如果对PBRL的偏好是确定性的,则可能不存在独特的最佳策略。如果偏好是随机的,并且偏好概率与隐藏的奖励值有关,则我们提出了带有和不具有模拟器的PBRL的算法,这些算法能够识别出具有很高概率的准确性$ \ VAREPSILON $的最佳策略。我们的方法通过导航到探索不足的状态来探索状态空间,并使用决斗匪徒和政策搜索的组合解决PBRL。实验显示了我们方法将其应用于现实世界问题时的功效。
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.