论文标题
ChronoSperseus:基于随机点的价值迭代,具有POSMDP的重要性采样
ChronosPerseus: Randomized Point-based Value Iteration with Importance Sampling for POSMDPs
论文作者
论文摘要
在强化学习中,代理成功使用了以马尔可夫决策过程(MDP)建模的环境。但是,在许多问题域中,代理人可能会遭受嘈杂的观察或随机时间,直到其随后的决定为止。尽管可观察到的马尔可夫决策过程(POMDP)已经处理了嘈杂的观察,但他们尚未处理未知的时间方面。当然,人们可以离散时间,但这导致了贝尔曼的维度诅咒。为了将连续的寄居时间分布纳入代理商的决策中,我们建议部分可观察到的半马尔可夫决策过程(POSMDPS)在这方面可能会有所帮助。我们扩展了\ citet {spaan2005a}基于随机点的值迭代(PBVI)\ textsc {perseus}算法,用于POMDP,通过结合连续的sojourn时间分布并使用重要性来降低求解求解器的复杂性,从而将POMDP用于POSMDP。我们称此新的PBVI算法为POSMDPS -\ textsc {chronosperseus},其重要性采样。这进一步允许通过将这些信息移至pOMSDP的状态周时间来进行压缩的复杂POMDP,需要时间状态信息。第二个见解是,可以在单个备份中使用一组抽样的时间并通过其可能性加权。这有助于进一步降低算法复杂性。该求解器还针对情节性和非情节性问题。我们以两个示例结束了论文,一个情节的巴士问题和非剧烈的维护问题。
In reinforcement learning, agents have successfully used environments modeled with Markov decision processes (MDPs). However, in many problem domains, an agent may suffer from noisy observations or random times until its subsequent decision. While partially observable Markov decision processes (POMDPs) have dealt with noisy observations, they have yet to deal with the unknown time aspect. Of course, one could discretize the time, but this leads to Bellman's Curse of Dimensionality. To incorporate continuous sojourn-time distributions in the agent's decision making, we propose that partially observable semi-Markov decision processes (POSMDPs) can be helpful in this regard. We extend \citet{Spaan2005a} randomized point-based value iteration (PBVI) \textsc{Perseus} algorithm used for POMDP to POSMDP by incorporating continuous sojourn time distributions and using importance sampling to reduce the solver complexity. We call this new PBVI algorithm with importance sampling for POSMDPs -- \textsc{ChronosPerseus}. This further allows for compressed complex POMDPs requiring temporal state information by moving this information into state sojourn time of a POMSDP. The second insight is that keeping a set of sampled times and weighting it by its likelihood can be used in a single backup; this helps further reduce the algorithm complexity. The solver also works on episodic and non-episodic problems. We conclude our paper with two examples, an episodic bus problem and a non-episodic maintenance problem.