论文标题
马尔可夫决策过程中学习匪徒结构的贝叶斯方法
A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes
论文作者
论文摘要
在强化学习文献中,有许多用于上下文强盗(CB)或马尔可夫决策过程(MDP)环境的算法。但是,当在现实世界中部署强化学习算法,即使有领域专业知识,通常也很难知道将顺序决策问题视为CB或MDP是否适合。换句话说,行动会影响未来的状态,还是仅影响即时奖励?关于环境的性质做出错误的假设可能会导致学习效率低下,甚至可以阻止算法学习最佳政策,即使使用无限数据。在这项工作中,我们开发了一种在线算法,该算法使用贝叶斯假设测试方法来学习环境的性质。我们的算法允许从业人员合并有关环境是否是CB还是MDP的知识,并有效地在经典CB和基于MDP的算法之间插值,以减轻对环境误解的影响。我们执行模拟并证明,在CB设置中,我们的算法比基于MDP的算法降低了遗憾,而在非Bandit MDP设置中,我们的算法能够学习最佳策略,通常使基于MDP的算法相当地遗憾。
In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.