了解深度合作多代理增强学习中的价值分解算法

论文标题

了解深度合作多代理增强学习中的价值分解算法

Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning

论文作者

Dou, Zehao, Kuba, Jakub Grudzien, Yang, Yaodong

论文摘要

价值函数分解已成为合作游戏中缩放多代理增强学习（MARL）的流行经验法则。为了制定这样的分解规则，必须制定单个全球最大（IGM）原理的假设；也就是说，每个代理的分解值函数上的局部最大值必须等于关节值函数的全局最大值。但是，这一原则一般不必坚持。结果，隐藏了值分解算法的适用性，并且其相应的收敛属性仍然未知。在本文中，我们首先努力回答这些问题。具体来说，我们介绍了一组合作游戏，其中价值分解方法找到了它们的有效性，这被称为可分解的游戏。在可分解的游戏中，我们从理论上证明，应用多代理拟合的Q-材料算法（MA-FQI）将导致最佳的Q功能。在非解释的游戏中，在每次迭代处的Q功能需要投射到可分解功能空间中的情况下，MA-FQI的估计Q功能仍然可以收敛到最佳。在这两种情况下，我们都会通过实用的深层神经网络来考虑价值功能表示，并得出相应的收敛速率。总而言之，我们的结果首次就价值分解算法融合以及为什么表现良好为MARL从业者提供了理论见解。

Value function decomposition is becoming a popular rule of thumb for scaling up multi-agent reinforcement learning (MARL) in cooperative games. For such a decomposition rule to hold, the assumption of the individual-global max (IGM) principle must be made; that is, the local maxima on the decomposed value function per every agent must amount to the global maximum on the joint value function. This principle, however, does not have to hold in general. As a result, the applicability of value decomposition algorithms is concealed and their corresponding convergence properties remain unknown. In this paper, we make the first effort to answer these questions. Specifically, we introduce the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games. In decomposable games, we theoretically prove that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Q-function. In non-decomposable games, the estimated Q-function by MA-FQI can still converge to the optimum under the circumstance that the Q-function needs projecting into the decomposable function space at each iteration. In both settings, we consider value function representations by practical deep neural networks and derive their corresponding convergence rates. To summarize, our results, for the first time, offer theoretical insights for MARL practitioners in terms of when value decomposition algorithms converge and why they perform well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题