论文标题
人类在循环增强学习的价值驱动的代表性
Value Driven Representation for Human-in-the-Loop Reinforcement Learning
论文作者
论文摘要
由加固学习提供支持的交互式自适应系统(RL)具有许多潜在的应用,例如智能辅导系统。在这样的系统中,通常有一个外部人类系统设计师正在创建,监视和修改交互式自适应系统,试图提高其在目标结果上的性能。在本文中,我们重点介绍了如何帮助系统设计人员选择一组传感器或功能来定义增强剂学习代理使用的观察空间的算法基础。我们提出了一种算法,即价值驱动的表示(VDR),可以迭代和适应地增强增强学习代理的观察空间,因此足以捕获(近)最佳策略。为此,我们介绍了一种新方法,可以使用离线模拟的蒙特卡洛推出来乐观地估算策略的价值。我们通过模拟人类评估了对标准RL基准测试的方法的性能,并在先前的基准中表现出显着改善。
Interactive adaptive systems powered by Reinforcement Learning (RL) have many potential applications, such as intelligent tutoring systems. In such systems there is typically an external human system designer that is creating, monitoring and modifying the interactive adaptive system, trying to improve its performance on the target outcomes. In this paper we focus on algorithmic foundation of how to help the system designer choose the set of sensors or features to define the observation space used by reinforcement learning agent. We present an algorithm, value driven representation (VDR), that can iteratively and adaptively augment the observation space of a reinforcement learning agent so that is sufficient to capture a (near) optimal policy. To do so we introduce a new method to optimistically estimate the value of a policy using offline simulated Monte Carlo rollouts. We evaluate the performance of our approach on standard RL benchmarks with simulated humans and demonstrate significant improvement over prior baselines.