QATTEN：合作多种强化学习的一般框架

论文标题

QATTEN：合作多种强化学习的一般框架

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

论文作者

Yang, Yaodong, Hao, Jianye, Liao, Ben, Shao, Kun, Chen, Guangyong, Liu, Wulong, Tang, Hongyao

论文摘要

在许多实际任务中，鉴于他们的私人观察和有限的沟通能力，多个代理必须学会彼此协调。深度多基础的增强学习（Deep-Marl）算法在这种挑战性的环境中表现出了出色的性能。一个代表性的工作类别是多重值分解，它将全局共享的多重Q值$ q_ {tot} $分解为单个Q-Values $ q^{i} $，以指导个人的行为，即VDN施加添加剂形式，并使用隐式混合方法采用单调假设。但是，以前的大多数努力都在$ q_ {tot} $和$ q^{i} $和缺乏理论基础之间施加了某些假设。此外，当将个人$ q^{i} $ s转换为$ q_ {tot} $时，他们没有明确考虑个人对整个系统的代理级别的影响。 In this paper, we theoretically derive a general formula of $Q_{tot}$ in terms of $Q^{i}$, based on which we can naturally implement a multi-head attention formation to approximate $Q_{tot}$, resulting in not only a refined representation of $Q_{tot}$ with an agent-level attention mechanism, but also a tractable maximization algorithm of decentralized政策。广泛的实验表明，在不同情况下，我们的方法在广泛采用的星际争霸基准上优于最先进的MARL方法，并且通过有价值的见解进一步进行了注意分析。

In many real-world tasks, multiple agents must learn to coordinate with each other given their private observations and limited communication ability. Deep multiagent reinforcement learning (Deep-MARL) algorithms have shown superior performance in such challenging settings. One representative class of work is multiagent value decomposition, which decomposes the global shared multiagent Q-value $Q_{tot}$ into individual Q-values $Q^{i}$ to guide individuals' behaviors, i.e. VDN imposing an additive formation and QMIX adopting a monotonic assumption using an implicit mixing method. However, most of the previous efforts impose certain assumptions between $Q_{tot}$ and $Q^{i}$ and lack theoretical groundings. Besides, they do not explicitly consider the agent-level impact of individuals to the whole system when transforming individual $Q^{i}$s into $Q_{tot}$. In this paper, we theoretically derive a general formula of $Q_{tot}$ in terms of $Q^{i}$, based on which we can naturally implement a multi-head attention formation to approximate $Q_{tot}$, resulting in not only a refined representation of $Q_{tot}$ with an agent-level attention mechanism, but also a tractable maximization algorithm of decentralized policies. Extensive experiments demonstrate that our method outperforms state-of-the-art MARL methods on the widely adopted StarCraft benchmark across different scenarios, and attention analysis is further conducted with valuable insights.

下载PDF全文

下载文献需遵守相关版权规定

论文标题