通过模仿Oracle Planner来学习玩不完美的信息游戏

论文标题

通过模仿Oracle Planner来学习玩不完美的信息游戏

Learning to Play Imperfect-Information Games by Imitating an Oracle Planner

论文作者

Boney, Rinu, Ilin, Alexander, Kannala, Juho, Seppänen, Jarno

论文摘要

我们考虑学习使用同时举动和大型国家行动空间玩多人游戏不完美的信息。以前解决此类挑战性游戏的尝试主要集中在无模型的学习方法上，通常需要数百年的经验来生产竞争力的代理商。我们的方法基于基于模型的计划。我们通过首先构建（甲骨文）计划者可以访问环境状态，然后将Oracle的知识蒸馏到（追随者）代理，该训练通过模仿Oracle的选择来玩游戏不完美的信息游戏，从而解决了部分可观察性问题。我们通过实验表明，在较大的组合动作空间中，使用幼稚的蒙特卡洛树搜索进行的计划表现不佳。因此，我们建议通过固定深入的树搜索和脱钩的汤普森抽样进行计划，以进行行动选择。我们表明，计划者能够在Clash Royale和Pommerman的游戏中发现有效的比赛策略，以及追随者政策成功地学习了通过在几百场战斗中进行训练来实施它们。

We consider learning to play multiplayer imperfect-information games with simultaneous moves and large state-action spaces. Previous attempts to tackle such challenging games have largely focused on model-free learning methods, often requiring hundreds of years of experience to produce competitive agents. Our approach is based on model-based planning. We tackle the problem of partial observability by first building an (oracle) planner that has access to the full state of the environment and then distilling the knowledge of the oracle to a (follower) agent which is trained to play the imperfect-information game by imitating the oracle's choices. We experimentally show that planning with naive Monte Carlo tree search does not perform very well in large combinatorial action spaces. We therefore propose planning with a fixed-depth tree search and decoupled Thompson sampling for action selection. We show that the planner is able to discover efficient playing strategies in the games of Clash Royale and Pommerman and the follower policy successfully learns to implement them by training on a few hundred battles.

下载PDF全文

下载文献需遵守相关版权规定

论文标题