论文标题
调整基于信息价值的强化学习的勘探速率
Adapting the Exploration Rate for Value-of-Information-Based Reinforcement Learning
论文作者
论文摘要
在本文中,我们考虑使用基于信息价值的探索时调整勘探速率的问题。我们通过将信息值优化转换为发现流量平衡的问题来实现这一目标,以使勘探速率变化。然后,我们开发了一种有效的路径遵循方案,用于将这些平衡收敛,从而揭示最佳的作用选择策略。在此计划下,勘探率会根据代理商的经验自动调整。从理论上讲,全球收敛是可以保证的。 我们首先评估了关于Nintendo Gameboy Games Centipede和Millipede的探索率改编。我们演示了搜索过程的各个方面,例如它产生了状态抽象的层次结构。我们还表明,与依靠基于启发式,基于退火的勘探速率调整的传统搜索策略相比,我们的方法在发作中返回更好的政策。然后,我们说明这些趋势适用于深层,信息价值的代理商,这些代理商学习玩十款简单的游戏,并为Nintendo Gameboy System进行了四十多种复杂的游戏。观察到附近或远高于人类比赛水平的表现。
In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process, like that it yields a hierarchy of state abstractions. We also show that our approach returns better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.