论文标题

好奇心杀死或丧失了猫和渐近最佳剂的能力

Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal Agent

论文作者

Cohen, Michael K., Catt, Elliot, Hutter, Marcus

论文摘要

强化学习者是学会学习导致高奖励的行动的代理商。理想情况下,强化学习者的政策方法的价值是最佳的政策,最佳的知情政策是最大化奖励的政策。不幸的是,我们表明,如果保证在任何(可以计算的)环境中保证代理是“渐近最佳的”,然后遵守对真实环境的假设,则该代理人将被“破坏”或“丧失能力”,或者具有概率1。在强化学习方面的许多工作都使用了精神性的假设来避免此问题。通常,在简化的假设下进行理论研究也使我们为我们提供实用的解决方案即使在没有这些假设的情况下也提供了实用的解决方案,但是强化学习中的千古假设可能使我们完全误入歧途,以准备危险环境中针对媒介的安全有效探索策略。我们没有考虑到问题,而是向受训者提出了一个适度的保证,即进行导师的表现,而不是鲁ck探索。至关重要的是,受训者的探索概率取决于探索所获得的预期信息。在一个简单的非共性环境中,导师较弱,我们发现受训者的表现优于现有的渐近剂及其导师。

Reinforcement learners are agents that learn to pick actions that lead to high reward. Ideally, the value of a reinforcement learner's policy approaches optimality--where the optimal informed policy is the one which maximizes reward. Unfortunately, we show that if an agent is guaranteed to be "asymptotically optimal" in any (stochastically computable) environment, then subject to an assumption about the true environment, this agent will be either "destroyed" or "incapacitated" with probability 1. Much work in reinforcement learning uses an ergodicity assumption to avoid this problem. Often, doing theoretical research under simplifying assumptions prepares us to provide practical solutions even in the absence of those assumptions, but the ergodicity assumption in reinforcement learning may have led us entirely astray in preparing safe and effective exploration strategies for agents in dangerous environments. Rather than assuming away the problem, we present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration. Critically, Mentee's exploration probability depends on the expected information gain from exploring. In a simple non-ergodic environment with a weak mentor, we find Mentee outperforms existing asymptotically optimal agents and its mentor.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源