论文标题
在连续MDP中计划的深层反应性策略的样品效率低下界限优化
Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs
论文作者
论文摘要
深度学习的最新进展已通过将参数策略编码为深神经网络,并在基于端到端的模型梯度下降框架中使用参数策略来优化连续MDP计划的深度反应性策略(DRP)。事实证明,这种方法可有效地优化非线性连续MDP的DRP,但是它需要大量采样轨迹才能有效学习,并且可能会遇到较高的溶液质量差异。在这项工作中,我们重新访问了总体模型的DRP目标,而是对次要最大化的视角进行迭代优化DRP W.R.T.一个本地紧密的下降目标。 DRP学习作为迭代下限优化(ILBO)的这种新颖的表述特别有吸引力,因为(i)每个步骤在结构上比整体目标更易于优化,(ii)保证在某些理论条件下单调地改善了目标,并且(iii)它在迭代之间重复样品,从而降低样品的复杂性。经验评估证实,Ilbo比最先进的DRP规划师要有明显的样品效率,并且始终如一地产生更好的解决方案质量,并具有较低的方差。我们还证明,Ilbo在不需要重新训练的情况下很好地概括了新的问题实例(即不同的初始状态)。
Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-to-end model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorization-maximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sample-efficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining.