论文标题
多步深入pomdps的多步深化学习效果的实验研究
Experimental Study on The Effect of Multi-step Deep Reinforcement Learning in POMDPs
论文作者
论文摘要
近年来,深度强化学习(DRL)在模拟机器人控制任务中都取得了巨大进步。对于可以通过完整的状态表示形式仔细设计的任务尤其如此,然后可以作为马尔可夫决策过程(MDP)提出。但是,应用为新颖的机器人控制任务应用的DRL策略可能具有挑战性,因为可用的观察结果可能是国家的部分表示,从而导致了可观察到的马尔可夫决策过程(POMDP)。本文考虑了三种流行的DRL算法,即近端政策优化(PPO),双重延迟的深层确定性策略梯度(TD3)和软性参与者(SAC),用于MDPS,并在POMDP方案中研究了其表现。虽然先前的工作发现SAC和TD3通常在可以用作MDP的广泛任务上胜过PPO,但使用三种代表性的POMDP环境,我们表明并非总是如此。实证研究表明,这与多步启动引导有关,其中多步即时奖励而不是立即奖励,用于计算观察和动作对的目标价值估计。我们通过观察到TD3(MTD3)和SAC(MSAC)在POMDP设置中的鲁棒性改善的纳入TD3(MTD3)和SAC(MSAC)中的包含。
Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. This is particularly the case for tasks that can be carefully engineered with a full state representation, and which can then be formulated as a Markov Decision Process (MDP). However, applying DRL strategies designed for MDPs to novel robot control tasks can be challenging, because the available observations may be a partial representation of the state, resulting in a Partially Observable Markov Decision Process (POMDP). This paper considers three popular DRL algorithms, namely Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC), invented for MDPs, and studies their performance in POMDP scenarios. While prior work has found that SAC and TD3 typically outperform PPO across a broad range of tasks that can be represented as MDPs, we show that this is not always the case, using three representative POMDP environments. Empirical studies show that this is related to multi-step bootstrapping, where multi-step immediate rewards, instead of one-step immediate reward, are used to calculate the target value estimation of an observation and action pair. We identify this by observing that the inclusion of multi-step bootstrapping in TD3 (MTD3) and SAC (MSAC) results in improved robustness in POMDP settings.