原始偶 - 批评与非线性函数近似的可证明有效收敛

论文标题

原始偶 - 批评与非线性函数近似的可证明有效收敛

Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation

论文作者

Dong, Jing, Shen, Li, Xu, Yinggan, Wang, Baoxiang

论文摘要

我们研究了在非convex-nonconcave原始二重式配方下，参与者 - 批评算法与非线性函数近似的收敛性。随机梯度下降的上升是针对健壮学习率的自适应近端项。我们以原始的双偶会批评显示了第一个有效的收敛结果，其收敛速率为$ \ MATHCAL {O} \ left（\ sqrt {\ sqrt {\ frac {\ frac {\ ln \ left（n d g^2 \^\ right）} {n}}} {n}}}}} \ right yis $ g $ n e $ g $ n embl.迭代和$ d $是梯度的维度。我们的结果仅显示了二元变量的Polyak-lojasiewicz条件，该条件易于验证和适用于广泛的强化学习（RL）方案。该算法和分析足以将其应用于其他RL设置，例如多代理RL。 Openai Gym连续控制任务的经验结果证实了我们的理论发现。

We study the convergence of the actor-critic algorithm with nonlinear function approximation under a nonconvex-nonconcave primal-dual formulation. Stochastic gradient descent ascent is applied with an adaptive proximal term for robust learning rates. We show the first efficient convergence result with primal-dual actor-critic with a convergence rate of $\mathcal{O}\left(\sqrt{\frac{\ln \left(N d G^2 \right)}{N}}\right)$ under Markovian sampling, where $G$ is the element-wise maximum of the gradient, $N$ is the number of iterations, and $d$ is the dimension of the gradient. Our result is presented with only the Polyak-Łojasiewicz condition for the dual variables, which is easy to verify and applicable to a wide range of reinforcement learning (RL) scenarios. The algorithm and analysis are general enough to be applied to other RL settings, like multi-agent RL. Empirical results on OpenAI Gym continuous control tasks corroborate our theoretical findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题