通过预测奖励，在部分可观察的环境中最大化信息增益

论文标题

通过预测奖励，在部分可观察的环境中最大化信息增益

Maximizing Information Gain in Partially Observable Environments via Prediction Reward

论文作者

Satsangi, Yash, Lim, Sungsu, Whiteson, Shimon, Oliehoek, Frans, White, Martha

论文摘要

可以在部分可观察到的环境中收集的信息可以作为加强学习（RL）提出的问题，即奖励取决于代理商的不确定性。例如，奖励可以是代理商对未知（或隐藏）变量的信念的负面熵。通常，RL代理的奖励定义为国家行动对的函数，而不是代理信念的函数。这阻碍了深度RL方法在此类任务中的直接应用。本文通过提供一个简单的见解来解决对深度RL代理的挑战，即可以通过最大程度地估算出最大化代理信念的凸功能的简单见解，而是通过最大化预测奖励：基于预测准确性的奖励。特别是，我们在负熵和预期预测奖励之间得出了确切的误差。这种见解为使用预测奖励的几个领域提供了理论动机，即视觉关注，问答系统和内在动机，并突出了它们与通常的主动感知，主动感应和传感器放置的通常不同领域的联系。基于这种见解，我们提出了深厚的预期网络（DAN），该网络使代理商能够采取行动来减少其不确定性而无需执行明确的信念推断。我们提出了DAN的两个应用：建立一个传感器选择系统，用于在购物中心跟踪人们，并学习有关时尚MNIST和MNIST数字分类的离散模型。

Information gathering in a partially observable environment can be formulated as a reinforcement learning (RL), problem where the reward depends on the agent's uncertainty. For example, the reward can be the negative entropy of the agent's belief over an unknown (or hidden) variable. Typically, the rewards of an RL agent are defined as a function of the state-action pairs and not as a function of the belief of the agent; this hinders the direct application of deep RL methods for such tasks. This paper tackles the challenge of using belief-based rewards for a deep RL agent, by offering a simple insight that maximizing any convex function of the belief of the agent can be approximated by instead maximizing a prediction reward: a reward based on prediction accuracy. In particular, we derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards---namely visual attention, question answering systems, and intrinsic motivation---and highlights their connection to the usually distinct fields of active perception, active sensing, and sensor placement. Based on this insight we present deep anticipatory networks (DANs), which enables an agent to take actions to reduce its uncertainty without performing explicit belief inference. We present two applications of DANs: building a sensor selection system for tracking people in a shopping mall and learning discrete models of attention on fashion MNIST and MNIST digit classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题