论文标题
通过奖励归因分解的多代理协作
Multi-Agent Collaboration via Reward Attribution Decomposition
论文作者
论文摘要
在Quake 3和Dota 2之类的游戏中,多机构增强学习(MARL)的最新进展已取得了超人的性能。不幸的是,这些技术需要比人类更多的训练训练回合,并且即使在同一游戏中也不会推广到新的代理配置。在这项工作中,我们提出了协作Q学习(Collaq),该Q学习(Collaq)在Starcraft Multi-Agent Challenge中取得了最先进的表现,并支持临时团队的比赛。我们首先将多代理协作制定为奖励分配的联合优化,并表明每个代理商都有一个近似最佳的政策,该政策分解为两个部分:仅依赖于代理人自己的状态,而另一部分则与附近代理商有关。在这一新颖的发现之后,Collaq将每个代理的Q功能分解为一个自学和互动术语,并具有多代理奖励归因(MARA)损失,使训练正常。在各种星际争霸地图上评估了Collaq,并表明它的表现优于现有的最先进技术(即QMIX,QTRAN和VDN),通过将获胜率提高40%,并以相同数量的样本提高了40%。在更具挑战性的临时团队比赛设置(即,重量/添加/删除单元而无需重新训练或填充),Collaq的表现优于先前的SOTA超过30%。
Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and don't generalize to new agent configurations even on the same game. In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that each agent has an approximately optimal policy that decomposes into two parts: one part that only relies on the agent's own state, and the other part that is related to states of nearby agents. Following this novel finding, CollaQ decomposes the Q-function of each agent into a self term and an interactive term, with a Multi-Agent Reward Attribution (MARA) loss that regularizes the training. CollaQ is evaluated on various StarCraft maps and shows that it outperforms existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of samples. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.