通用公用事业的加强学习的变分策略梯度方法

论文标题

通用公用事业的加强学习的变分策略梯度方法

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

论文作者

Zhang, Junyu, Koppel, Alec, Bedi, Amrit Singh, Szepesvari, Csaba, Wang, Mengdi

论文摘要

近年来，具有一般目标超出累积奖励总和的强化学习（RL）系统已获得了吸引力，例如在受到限制的问题，探索和对先前经验的行动中。在本文中，我们考虑了马尔可夫决策问题中的政策优化，在该问题中，目标是国家行动占用度量的一般凹工用函数，该措施将上述几个示例作为特殊情况。这样的通用性使贝尔曼方程无效。由于这意味着动态编程不再有效，因此我们专注于直接的策略搜索。类似于策略梯度定理\ cite {sutton2000 -Policy}可用于累积奖励的RL，我们为RL提供了一种新的变化策略定理，用于RL的一般公用事业，这确立了参数化的策略梯度可以作为涉及Fenchel fenchel Dialitial nitial nitial nitial nitial nitual in Unitial的解决方案。我们开发了一种跨蒙特卡洛梯度估计算法，以根据样本路径计算策略梯度。我们证明，尽管优化问题是非convex，但各变化策略梯度方案将全球趋于全球范围为一般目标的最佳策略。我们还通过利用问题的隐藏凸度来确定其$ o（1/t）$的收敛速度，并证明当问题承认隐藏的强凸度时，它会呈指数收敛。我们的分析适用于特殊情况，并以累积奖励为标准的RL问题，在这种情况下，我们的结果提高了可用的收敛速度。

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题