分散的自然政策梯度，降低了协作多机构增强学习的差异

论文标题

分散的自然政策梯度，降低了协作多机构增强学习的差异

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

论文作者

Chen, Jinchi, Feng, Jie, Gao, Weiguo, Wei, Ke

论文摘要

本文研究了一个政策优化问题，该问题是由协作多代理强化学习在分散的环境中引起的，在该环境中，代理商通过无方向的图表与邻居进行交流，以最大程度地提高其累积奖励的总和。提出了一种新型的分散自然政策梯度方法，称为基于动量的分散自然政策梯度（MDNPG），该方法拟议中，它结合了自然梯度，基于动量的差异，并梯度跟踪到分散的随机梯度上升梯度上升框架中。 $ \ MATHCAL {O}（n^{ - 1}ε^{ - 3}）$样本的MDNPG样本复杂性收敛到$ε$ - 稳定点是在标准假设下建立的，其中$ n $是代理的数量。它表明MDNPG可以实现分散策略梯度方法的最佳收敛速率，并且与集中式优化方法相比，具有线性加速。此外，通过广泛的数值实验证明了MDNPG超过其他最先进算法的优质经验性能。

This paper studies a policy optimization problem arising from collaborative multi-agent reinforcement learning in a decentralized setting where agents communicate with their neighbors over an undirected graph to maximize the sum of their cumulative rewards. A novel decentralized natural policy gradient method, dubbed Momentum-based Decentralized Natural Policy Gradient (MDNPG), is proposed, which incorporates natural gradient, momentum-based variance reduction, and gradient tracking into the decentralized stochastic gradient ascent framework. The $\mathcal{O}(n^{-1}ε^{-3})$ sample complexity for MDNPG to converge to an $ε$-stationary point has been established under standard assumptions, where $n$ is the number of agents. It indicates that MDNPG can achieve the optimal convergence rate for decentralized policy gradient methods and possesses a linear speedup in contrast to centralized optimization methods. Moreover, superior empirical performance of MDNPG over other state-of-the-art algorithms has been demonstrated by extensive numerical experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题