论文标题
支持离线强化学习的政策优化
Supported Policy Optimization for Offline Reinforcement Learning
论文作者
论文摘要
脱机强化学习(RL)的策略约束方法通常会利用参数化或正规化,该策略限制了在行为策略的支持集中执行操作的策略。参数化方法的精明设计通常会侵入政策网络,这可能会带来额外的推理成本,并且无法充分利用良好的在线方法。正则化方法降低了学到的政策与行为策略之间的差异,这可能使基于固有密度的支持的定义不匹配,从而无法有效地避免出现分发措施。本文介绍了支持的政策优化(SPOT),该优化直接来自基于密度的支持约束的理论形式化。 SPOT采用基于VAE的密度估计器,以明确对行为策略的支持集进行建模,并提出一个简单但有效的基于密度的正则化项,可以将其非插入到现成的非货币非政策反式RL算法中。 Spot在离线RL的标准基准测试基准上实现了最先进的性能。从可插入的设计中受益,也可以使用Spot的离线预处理模型来无缝执行在线微调。
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. SPOT achieves the state-of-the-art performance on standard benchmarks for offline RL. Benefiting from the pluggable design, offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.