论文标题
多摩托:通过非反应分析的连续空间MDP中的蒙特卡洛计划
POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis
论文作者
论文摘要
蒙特 - 卡洛(Monte-Carlo)搜索(MCT)所举例说明的蒙特卡罗规划在有限空间的应用中表现出了出色的性能。在本文中,我们考虑在具有连续的州行动空间的环境中考虑蒙特卡洛的规划,这是控制和机器人技术中重要应用的理解不足的问题。我们介绍了Poly-Hoot,这是一种算法,该算法以连续的武装匪徒策略来增强MCT,称为层次乐观优化(Hoo)(Bubeck等,2011)。具体而言,我们通过使用适当的多项式(而不是对数)奖励项来增强HOO。这种多项式奖金是由其在Alphago Zero中的经验成功激发的(Silver等,2017b),以及其在实现有限空间MCT的理论保证中的重要作用(Shah等,2019)。我们首次研究了非平稳匪徒问题中增强的HOO算法的遗憾。使用此结果作为构建块,我们为多型霍特建立了非质合收敛保证:值估计收敛到以多项式速率的最佳值函数的任意小社区。我们进一步提供实验结果来证实我们的理论发现。
Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for POLY-HOOT: the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings.