论文标题
事后观察
Episodic Self-Imitation Learning with Hindsight
论文作者
论文摘要
提出了一种具有轨迹选择模块和自适应损失函数的新型自我图像算法,以加快增强性学习。与原始的自我图像学习算法相比,该算法从体验重播缓冲液中采样了良好的州行动对,我们的代理商以事后的看法利用了整个情节来帮助自我构图学习。引入了一个选择模块,以从更新的每个情节中过滤非信息性样本。所提出的方法克服了标准自我构图学习算法的局限性,这是一种基于过渡的方法,在处理连续控制环境的情况下,奖励稀疏。从实验中,表明情节性自我图像学习的性能优于基线上的式算法,从而在几个模拟机器人控制任务中实现了与最先进的非政策外算法相当的性能。显示轨迹选择模块可防止代理学习不良的后视经历。凭借解决连续控制环境中稀疏奖励问题的能力,情节自我构图学习有可能应用于具有连续动作空间的现实世界中的问题,例如机器人指导和操纵。
Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state-action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.