保守的分配加强学习和安全限制

论文标题

保守的分配加强学习和安全限制

Conservative Distributional Reinforcement Learning with Safety Constraints

论文作者

Zhang, Hengrui, Lin, Youfang, Han, Sheng, Wang, Shuo, Lv, Kai

论文摘要

安全探索可以被视为约束马尔可夫决策问题，预期的长期成本受到限制。以前的非政策算法通过引入Lagrangian松弛技术，将约束优化问题转换为相应的无约束双重问题。但是，上述算法的成本函数提供了不准确的估计，并导致Lagrange乘数学习的不稳定。在本文中，我们提出了一种新型的范围内强化学习算法，称为保守分布最大值后验策略优化（CDMPO）。首先，为了准确判断当前情况是否满足限制，CDMPO适应了分布强化学习方法来估计Q功能和C功能。然后，CDMPO使用保守的价值函数损失来减少勘探过程中违反约束的数量。此外，我们利用加权平均比例积分衍生物（WAPID）稳定地更新Lagrange乘数。经验结果表明，在早期勘探过程中，提出的方法对限制的侵犯更少。最终测试结果还表明我们的方法具有更好的风险控制。

Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. However, the cost function of the above algorithms provides inaccurate estimations and causes the instability of the Lagrange multiplier learning. In this paper, we present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization (CDMPO). At first, to accurately judge whether the current situation satisfies the constraints, CDMPO adapts distributional reinforcement learning method to estimate the Q-function and C-function. Then, CDMPO uses a conservative value function loss to reduce the number of violations of constraints during the exploration process. In addition, we utilize Weighted Average Proportional Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical results show that the proposed method has fewer violations of constraints in the early exploration process. The final test results also illustrate that our method has better risk control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题