降低差异的保守政策迭代

论文标题

降低差异的保守政策迭代

Variance-Reduced Conservative Policy Iteration

论文作者

Agarwal, Naman, Bullins, Brian, Singh, Karan

论文摘要

我们研究了将增强学习减少到政策空间上一系列经验风险最小化问题的样本复杂性。与策略梯度算法的参数空间相反，这种基于减少的算法在功能空间中表现出局部收敛性，因此不受策略类别的非线性或不连续参数化的影响。我们提出了一种降低的保守政策迭代的差异变体，可改善从$ O（\ varepsilon^{ - 4}）$到$ O（\ varepsilon^{ - 3}）$的$ o（\ varepsilon^{ - 4}）$从$ o（\ varepsilon^{ - 4}）$中提高样本复杂性。在国家覆盖和策略完整性假设下，该算法在抽样$ o（\ varepsilon^{ - 2}）$ times之后享受$ \ varepsilon $ -Global最优性，可在先前确定的$ O（\ varepsilon^{ - 3}）上改善。

We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题