通过奖励归因分解的多代理协作

论文标题

通过奖励归因分解的多代理协作

Multi-Agent Collaboration via Reward Attribution Decomposition

论文作者

Zhang, Tianjun, Xu, Huazhe, Wang, Xiaolong, Wu, Yi, Keutzer, Kurt, Gonzalez, Joseph E., Tian, Yuandong

论文摘要

在Quake 3和Dota 2之类的游戏中，多机构增强学习（MARL）的最新进展已取得了超人的性能。不幸的是，这些技术需要比人类更多的训练训练回合，并且即使在同一游戏中也不会推广到新的代理配置。在这项工作中，我们提出了协作Q学习（Collaq），该Q学习（Collaq）在Starcraft Multi-Agent Challenge中取得了最先进的表现，并支持临时团队的比赛。我们首先将多代理协作制定为奖励分配的联合优化，并表明每个代理商都有一个近似最佳的政策，该政策分解为两个部分：仅依赖于代理人自己的状态，而另一部分则与附近代理商有关。在这一新颖的发现之后，Collaq将每个代理的Q功能分解为一个自学和互动术语，并具有多代理奖励归因（MARA）损失，使训练正常。在各种星际争霸地图上评估了Collaq，并表明它的表现优于现有的最先进技术（即QMIX，QTRAN和VDN），通过将获胜率提高40％，并以相同数量的样本提高了40％。在更具挑战性的临时团队比赛设置（即，重量/添加/删除单元而无需重新训练或填充），Collaq的表现优于先前的SOTA超过30％。

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and don't generalize to new agent configurations even on the same game. In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that each agent has an approximately optimal policy that decomposes into two parts: one part that only relies on the agent's own state, and the other part that is related to states of nearby agents. Following this novel finding, CollaQ decomposes the Q-function of each agent into a self term and an interactive term, with a Multi-Agent Reward Attribution (MARA) loss that regularizes the training. CollaQ is evaluated on various StarCraft maps and shows that it outperforms existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of samples. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题