论文标题
在非平稳任务和游戏中连续进攻增强学习的统一政策优化
Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games
论文作者
论文摘要
本文通过连续行动解决了非组织环境和游戏中的政策学习。我们提出了一种无需rebret式的奖励机制(FTRL)和镜像下降(MD)更新的想法的启发,而是提出了一种无需重新固定样式的增强算法,以进行连续动作任务。我们证明,Porl具有最后的融合保证,这对于对抗和合作游戏很重要。实证研究表明,在控制任务的静止环境中,PORL的性能同样好,即使不是软角色 - 批评(SAC)算法;在包括动态环境,对抗性训练和竞争性游戏在内的非机构环境中,Porl在更好的最终政策表现和更稳定的培训过程中都优于SAC。
This paper addresses policy learning in non-stationary environments and games with continuous actions. Rather than the classical reward maximization mechanism, inspired by the ideas of follow-the-regularized-leader (FTRL) and mirror descent (MD) update, we propose a no-regret style reinforcement learning algorithm PORL for continuous action tasks. We prove that PORL has a last-iterate convergence guarantee, which is important for adversarial and cooperative games. Empirical studies show that, in stationary environments such as MuJoCo locomotion controlling tasks, PORL performs equally well as, if not better than, the soft actor-critic (SAC) algorithm; in non-stationary environments including dynamical environments, adversarial training, and competitive games, PORL is superior to SAC in both a better final policy performance and a more stable training process.