基于线性编程的解决方案方法，用于约束可观察的马尔可夫决策过程

论文标题

基于线性编程的解决方案方法，用于约束可观察的马尔可夫决策过程

Linear programming-based solution methods for constrained partially observable Markov decision processes

论文作者

Helmeczi, Robert K., Kavaklioglu, Can, Cevik, Mucahit

论文摘要

受约束的部分可观察到的马尔可夫决策过程（CPOMDP）已用于模拟各种现实现象。但是，众所周知，它们很难解决最优性，并且只有几种近似方法来获得高质量的解决方案。在这项研究中，将基于网格的近似值与线性编程（LP）模型结合使用，以生成CPOMDP的近似策略。考虑了六个CPOMDP问题实例，考虑其有限和无限的地平线公式，进行了详细的数值研究。通过使用精确解决方案方法进行比较分析，建立了解决无约束POMDP问题的近似算法的质量。然后，评估了基于LP的CPOMDP解决方案方法以不同的预算水平进行评估。最后，通过应用确定性政策限制来证明基于LP的方法的灵活性，并提供了对其对奖励和CPU运行时间的影响的详细研究。对于大多数有限的地平线问题，确定性的政策限制对预期奖励的影响很小，但它们引入了CPU运行时间的显着增加。对于无限的地平线问题，观察到相反：确定性策略往往比其随机性对应物产生的预期总奖励较低，但是在这种情况下，确定性约束对CPU运行时间的影响可以忽略不计。总体而言，这些结果表明，LP模型可以有效地为有限和无限的地平线问题生成近似策略，同时提供了将各种其他约束结合到基础模型中的灵活性。

Constrained partially observable Markov decision processes (CPOMDPs) have been used to model various real-world phenomena. However, they are notoriously difficult to solve to optimality, and there exist only a few approximation methods for obtaining high-quality solutions. In this study, grid-based approximations are used in combination with linear programming (LP) models to generate approximate policies for CPOMDPs. A detailed numerical study is conducted with six CPOMDP problem instances considering both their finite and infinite horizon formulations. The quality of approximation algorithms for solving unconstrained POMDP problems is established through a comparative analysis with exact solution methods. Then, the performance of the LP-based CPOMDP solution approaches for varying budget levels is evaluated. Finally, the flexibility of LP-based approaches is demonstrated by applying deterministic policy constraints, and a detailed investigation into their impact on rewards and CPU run time is provided. For most of the finite horizon problems, deterministic policy constraints are found to have little impact on expected reward, but they introduce a significant increase to CPU run time. For infinite horizon problems, the reverse is observed: deterministic policies tend to yield lower expected total rewards than their stochastic counterparts, but the impact of deterministic constraints on CPU run time is negligible in this case. Overall, these results demonstrate that LP models can effectively generate approximate policies for both finite and infinite horizon problems while providing the flexibility to incorporate various additional constraints into the underlying model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题