解决多目标增强学习中随机环境和本地决策的问题

论文标题

解决多目标增强学习中随机环境和本地决策的问题

Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning

论文作者

Ding, Kewen

论文摘要

多目标增强学习（MORL）是一个相对较新的领域，基于传统的增强学习（RL）来解决多目标问题。常见算法之一是通过将向量Q值与实用程序函数结合使用来扩展标量值Q学习，该函数捕获了用户对操作选择的偏爱。这项研究遵循了先前的工作，并着重于哪些因素影响频率的频率在哪些基于随机状态过渡的环境中学习的最佳策略是在目标中，目标是最大程度地提高标量的预期回报（SER）（即，都可以最大程度地提高多个跑步的结果，而不是在每个单个情节中最大程度地提高结果。在简单的多目标马尔可夫决策过程（MOMDP）空间交易者问题上，随机环境与Morl Q学习算法之间的相互作用分析具有不同的变体版本。经验评估表明，设计良好的奖励信号可以改善原始基线算法的性能，但是它仍然不足以解决更一般的环境。显示了包含全球统计数据的MORL Q学习的一种变体可以优于原始太空交易者问题中的基线方法，但在培训结束时找到发现所需的赛车策略的有效性仍然低于100％。另一方面，保证期权学习可以融合到所需的服务策略，但无法扩展以解决现实生活中更复杂的问题。本论文的主要贡献是确定嘈杂的Q值估计问题在多大程度上影响在随机环境，非线性效用和恒定学习率的结合下学习最佳政策的能力。

Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL) to solve multi-objective problems. One of common algorithm is to extend scalar value Q-learning by using vector Q values in combination with a utility function, which captures the user's preference for action selection. This study follows on prior works, and focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment with stochastic state transitions in scenarios where the goal is to maximise the Scalarised Expected Return (SER) - that is, to maximise the average outcome over multiple runs rather than the outcome within each individual episode. The analysis of the interaction between stochastic environment and MORL Q-learning algorithms run on a simple Multi-objective Markov decision process (MOMDP) Space Traders problem with different variant versions. The empirical evaluations show that well designed reward signal can improve the performance of the original baseline algorithm, however it is still not enough to address more general environment. A variant of MORL Q-Learning incorporating global statistics is shown to outperform the baseline method in original Space Traders problem, but remains below 100 percent effectiveness in finding the find desired SER-optimal policy at the end of training. On the other hand, Option learning is guarantied to converge to desired SER-optimal policy but it is not able to scale up to solve more complex problem in real-life. The main contribution of this thesis is to identify the extent to which the issue of noisy Q-value estimates impacts on the ability to learn optimal policies under the combination of stochastic environments, non-linear utility and a constant learning rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题