迈向数据驱动的离线模拟在线增强学习

论文标题

迈向数据驱动的离线模拟在线增强学习

Towards Data-Driven Offline Simulations for Online Reinforcement Learning

论文作者

Tang, Shengpu, Frujeri, Felipe Vieira, Misra, Dipendra, Lamb, Alex, Langford, John, Mineiro, Paul, Kochman, Sebastian

论文摘要

从机器人到Web推荐引擎，有望适应现代决策系统：用户偏好，情况甚至新任务。但是，由于认为不安全，将动态学习的代理（而不是固定的策略）部署到生产系统上仍然很少见。使用历史数据来理解学习算法，类似于应用于固定政策的离线政策评估（OPE），可以帮助从业人员评估并最终将这种适应性剂部署到生产中。在这项工作中，我们将离线学习者模拟（OLS）正式化进行增强学习（RL），并提出了一种新的评估协议，以衡量模拟的忠诚度和效率。对于具有复杂高维观察的环境，我们提出了一种半参数方法，该方法利用潜在状态发现的最新进展，以实现准确有效的离线模拟。在初步实验中，与完全非参数基线相比，我们显示了方法的优势。复制这些实验的代码将在https://github.com/microsoft/rl-offline-simulation上提供。

Modern decision-making systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency of the simulation. For environments with complex high-dimensional observations, we propose a semi-parametric approach that leverages recent advances in latent state discovery in order to achieve accurate and efficient offline simulations. In preliminary experiments, we show the advantage of our approach compared to fully non-parametric baselines. The code to reproduce these experiments will be made available at https://github.com/microsoft/rl-offline-simulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题