懒惰的MDP：通过学习何时采取行动来迈向可解释的强化学习

论文标题

懒惰的MDP：通过学习何时采取行动来迈向可解释的强化学习

Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

论文作者

Jacq, Alexis, Ferret, Johan, Pietquin, Olivier, Geist, Matthieu

论文摘要

传统上，加强学习（RL）旨在决定如何最佳地为人工代理采取最佳作用。我们认为，决定何时行动同样重要。作为人类，我们从默认，本能或记忆的行为偏离了情况，可以在情况下进行集中的，经过思考的行为。为了通过这种能力增强RL代理，我们建议增强标准马尔可夫决策过程并制造新的行动方式：懒惰，它为默认策略做出决策。此外，我们对非懒惰行动进行惩罚，以鼓励最少的努力，并让代理只专注于关键决策。我们命名由此产生的形式主义懒惰MDP。我们研究懒惰MDP的理论特性，表达价值函数并表征最佳解决方案。然后，我们从经验上证明，以懒惰MDP的策略通常具有一种解释性形式：通过构造，它们向我们展示了代理人控制默认策略的州。我们认为这些状态和相应的行动很重要，因为它们解释了默认和新懒惰政策之间的性能差异。由于默认情况下的次优政策（经过审慎的或随机），我们观察到代理商能够在Atari游戏中获得竞争性能，而仅在有限的国家中获得控制权。

Traditionally, Reinforcement Learning (RL) aims at deciding how to act optimally for an artificial agent. We argue that deciding when to act is equally important. As humans, we drift from default, instinctive or memorized behaviors to focused, thought-out behaviors when required by the situation. To enhance RL agents with this aptitude, we propose to augment the standard Markov Decision Process and make a new mode of action available: being lazy, which defers decision-making to a default policy. In addition, we penalize non-lazy actions in order to encourage minimal effort and have agents focus on critical decisions only. We name the resulting formalism lazy-MDPs. We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs generally come with a form of interpretability: by construction, they show us the states where the agent takes control over the default policy. We deem those states and corresponding actions important since they explain the difference in performance between the default and the new, lazy policy. With suboptimal policies as default (pretrained or random), we observe that agents are able to get competitive performance in Atari games while only taking control in a limited subset of states.

下载PDF全文

下载文献需遵守相关版权规定

论文标题