论文标题
两个时间级演员评论家方法的有限时间分析
A Finite Time Analysis of Two Time-Scale Actor Critic Methods
论文作者
论文摘要
与其他强化学习算法相比,Actor-Critic(AC)方法表现出了巨大的经验成功,演员使用政策梯度来改善学习政策,而评论家则使用时间差异学习来估计政策梯度。在两个时间尺度的学习率计划下,文献中对AC的渐近收敛进行了很好的研究。然而,参与者批评方法的非反应收敛和有限样品复杂性在很大程度上是开放的。在这项工作中,我们为非i.i.d的两个时间尺度的参与者方法提供了非反应分析。环境。我们证明,确保参与者 - 批判方法可以找到一阶静止点(即$ \ | \ | \ nabla j(\boldsymbolθ)\ | _2^2^2 \ |leε$)的$ j(\boldsymbolθ)$,带有$ \ \ \ \ \ \ \ \ tildeL {\ tildeL {\ tildel {\ tildel {样品复杂性。据我们所知,这是提供有限的时间分析和样本复杂性约束的第一项工作。
Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\boldsymbolθ)\|_2^2 \le ε$) of the non-concave performance function $J(\boldsymbolθ)$, with $\mathcal{\tilde{O}}(ε^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.