主动推断和强化学习：在部分可观察性下对连续状态和行动空间的统一推断

论文标题

主动推断和强化学习：在部分可观察性下对连续状态和行动空间的统一推断

Active Inference and Reinforcement Learning: A unified inference on continuous state and action spaces under partial observability

论文作者

Malekzadeh, Parvin, Plataniotis, Konstantinos N.

论文摘要

强化学习（RL）引起了人们在完全可观察到的环境中开发旨在最大化外部主管指定的决策代理人的重大关注。但是，许多现实世界中的问题涉及部分观察，以部分可观察到的马尔可夫决策过程（POMDP）。先前的研究通过纳入了过去的动作和观察的记忆，或通过从观察到的数据中推断环境状态来解决POMDP中的RL。但是，在连续空间中，汇总的观察到的数据随着时间的流逝而变得不切实际。此外，基于推理的RL方法通常需要许多样本才能表现良好，因为它们仅着眼于奖励最大化和忽视推断状态的不确定性。主动推理（AIF）是在POMDP中配制的框架，并指示代理通过最小化称为预期自由能（EFE）的函数来选择动作。这提供了奖励最大化（剥削性）行为，例如RL，以信息寻求（探索性）行为。尽管AIF的这种探索性行为，但由于与EFE相关的计算挑战，其使用量仅限于离散空间。在本文中，我们提出了一个统一的原则，该原则在AIF和RL之间建立了理论上的联系，从而使这两种方法无缝整合并克服了它们在连续的空间POMDP设置中的上述局限性。我们通过理论分析来证实我们的发现，为在人造药物的设计中提供了新的观点。实验结果表明，我们方法在解决连续空间的部分学习能力部分可观察到的任务。值得注意的是，我们的方法可以利用寻求信息的探索，使其能够有效解决无奖励问题，并为外部主管可选提供明确的任务奖励设计。

Reinforcement learning (RL) has garnered significant attention for developing decision-making agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial observations, formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. However, aggregating observed data over time becomes impractical in continuous spaces. Moreover, inference-based RL approaches often require many samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state. Active inference (AIF) is a framework formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (exploitative) behaviour, as in RL, with information-seeking (exploratory) behaviour. Despite this exploratory behaviour of AIF, its usage is limited to discrete spaces due to the computational challenges associated with EFE. In this paper, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their aforementioned limitations in continuous space POMDP settings. We substantiate our findings with theoretical analysis, providing novel perspectives for utilizing AIF in the design of artificial agents. Experimental results demonstrate the superior learning capabilities of our method in solving continuous space partially observable tasks. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems and rendering explicit task reward design by an external supervisor optional.

下载PDF全文

下载文献需遵守相关版权规定

论文标题