通过合格的行动优化离线政策优化

论文标题

通过合格的行动优化离线政策优化

Offline Policy Optimization with Eligible Actions

论文作者

Liu, Yao, Flet-Berliac, Yannis, Brunskill, Emma

论文摘要

离线政策优化可能会对许多现实世界中的决策问题产生重大影响，因为在线学习在许多应用程序中可能是不可行的。重要性采样及其变体是离线策略评估中一种常用的估计器类型，并且此类估计器通常不需要对价值函数或决策过程模型函数类的属性和表示能力的假设。在本文中，我们确定了一种重要的过度拟合现象，以优化重要性加权收益，在这种情况下，学到的政策可以从本质上避免为初始状态空间的一部分做出一致的决策。我们提出了一种算法，以避免通过新的每个国家 - 邻居标准化约束过度拟合，并提供对拟议算法的理论理由。我们还显示了以前尝试这种方法的局限性。我们在受医疗风格的模拟器中测试算法，这是从真实医院收集的记录数据集和持续的控制任务。这些实验表明，与最先进的批处理学习算法相比，所提出的方法的过度拟合和更好的测试性能。

Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题