POPO：悲观的离线政策优化

论文标题

POPO：悲观的离线政策优化

POPO: Pessimistic Offline Policy Optimization

论文作者

He, Qiang, Hou, Xinwen

论文摘要

离线增强学习（RL），也称为批处理RL，旨在优化大型预录制数据集的策略，而无需与环境互动。此设置提供了利用多样化的预采收数据集的承诺，以获取政策，而无需昂贵，冒险，主动探索。但是，从静态数据集中学习时，通常使用基于Q学习或参与者评价的非政策算法效果很差。在这项工作中，我们研究了为什么从价值函数视图从离线设置中学习的原因，我们提出了一种新颖的离线RL算法，我们称之为悲观的离线策略优化（POPO），该算法学习了一个悲观的价值功能，以获得强大的策略。我们发现POPO表现出色，并缩放到具有高维状态和动作空间的任务，比较或优于基准任务上的几种最先进的离线RL算法。

Offline reinforcement learning (RL), also known as batch RL, aims to optimize policy from a large pre-recorded dataset without interaction with the environment. This setting offers the promise of utilizing diverse, pre-collected datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy algorithms based on Q-learning or actor-critic perform poorly when learning from a static dataset. In this work, we study why off-policy RL methods fail to learn in offline setting from the value function view, and we propose a novel offline RL algorithm that we call Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming several state-of-the-art offline RL algorithms on benchmark tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题