论文标题
甲骨文在离线增强学习中进行模型选择的不平等现象
Oracle Inequalities for Model Selection in Offline Reinforcement Learning
论文作者
论文摘要
在离线增强学习(RL)中,学习者利用先前记录的数据来学习良好的策略而无需与环境互动。在实践中应用这种方法的一个主要挑战是缺乏用于模型选择和评估的理论原则性和实用工具。为了解决这个问题,我们研究了具有值函数近似值的离线RL中模型选择问题。为学习者提供了模型类的嵌套序列,以最大程度地减少平方的Bellman错误,并且必须在其中选择以在类的近似和估计误差之间达到平衡。我们为离线RL提出了第一个模型选择算法,该算法达到了最小值 - 最佳的甲骨文不平等现象,直到对数因素。该算法MODBE将候选模型类的集合和一个通用的基础离线RL算法作为输入。通过使用新颖的单方面概括测试依次消除模型类,Modbe返回了一个策略,并以最小化模型类的复杂性而遗憾地缩放。除了其理论保证外,它在概念上是简单且计算上有效的,相当于解决一系列的正方形损失回归问题,然后比较类之间的相对平方损失。我们以几个数值模拟结论,表明它能够可靠地选择一个良好的模型类。
In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, ModBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.