连续时间和离散空间的POMDP

论文标题

连续时间和离散空间的POMDP

POMDPs in Continuous Time and Discrete Spaces

论文作者

Alt, Bastian, Schultheis, Matthias, Koeppl, Heinz

论文摘要

许多过程，例如工程学中的离散事件系统或生物学中的人群动态，都在离散空间和连续时间发展。我们考虑在部分可观察性下这种离散状态和行动空间系统中最佳决策的问题。这将我们的工作置于最佳过滤和最佳控制的交集。在当前的研究状态下，仍然缺少与有限状态和动作空间的连续决策和过滤的数学描述。在本文中，我们给出了连续时间可观察到的马尔可夫决策过程（POMDP）的数学描述。通过利用最佳滤波理论，我们得出了表征最佳解决方案的汉密尔顿 - 雅各比 - 贝尔曼（HJB）类型方程。使用深度学习的技术，我们大致求解了所得的部分整数差异方程。我们通过学习价值函数的近似以及（ii）在线算法来离线解决决策问题的方法（i），该算法通过深入的增强学习为信仰空间提供了解决方案。我们在一组玩具示例上显示了适用性，为将来的方法铺平了为高维问题提供解决方案的道路。

Many processes, such as discrete event systems in engineering or population dynamics in biology, evolve in discrete space and continuous time. We consider the problem of optimal decision making in such discrete state and action space systems under partial observability. This places our work at the intersection of optimal filtering and optimal control. At the current state of research, a mathematical description for simultaneous decision making and filtering in continuous time with finite state and action spaces is still missing. In this paper, we give a mathematical description of a continuous-time partial observable Markov decision process (POMDP). By leveraging optimal filtering theory we derive a Hamilton-Jacobi-Bellman (HJB) type equation that characterizes the optimal solution. Using techniques from deep learning we approximately solve the resulting partial integro-differential equation. We present (i) an approach solving the decision problem offline by learning an approximation of the value function and (ii) an online algorithm which provides a solution in belief space using deep reinforcement learning. We show the applicability on a set of toy examples which pave the way for future methods providing solutions for high dimensional problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题