coptidice：通过固定分配校正估计，离线限制的加固学习

论文标题

coptidice：通过固定分配校正估计，离线限制的加固学习

COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

论文作者

Lee, Jongmin, Paduraru, Cosmin, Mankowitz, Daniel J., Heess, Nicolas, Precup, Doina, Kim, Kee-Eung, Guez, Arthur

论文摘要

我们考虑了离线限制的强化学习（RL）问题，其中代理商的目标是计算一种策略，该策略在满足给定的成本限制的同时只能从预先收集的数据集中学习。在许多现实世界中，与环境直接互动是昂贵或冒险的，以及由此产生的政策应遵守安全限制，这种问题设置具有吸引力。但是，计算保证在离线RL设置中满足成本限制的策略是一项挑战，因为非本质上的离线评估本质上存在估计错误。在本文中，我们提出了一种离线限制的RL算法，该算法在固定分布空间中优化了策略。我们的算法Coptidice直接估计了最佳政策对收益的固定分配校正，同时限制了成本上限，目的是产生成本保守的策略，以实现实际约束满意度。实验结果表明，Coptidice在约束满意度和返回最大化方面达到了更好的策略，表现优于基线算法。

We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题