利用先前的解决方案来奖励成型和熵限制的增强学习

论文标题

利用先前的解决方案来奖励成型和熵限制的增强学习

Utilizing Prior Solutions for Reward Shaping and Composition in Entropy-Regularized Reinforcement Learning

论文作者

Adamczyk, Jacob, Arriojas, Argenis, Tiomkin, Stas, Kulkarni, Rahul V.

论文摘要

在加强学习（RL）中，从先前解决的任务中利用先验知识的能力可以使代理可以快速解决新问题。在某些情况下，这些新问题可以通过组成先前解决的原始任务的解决方案（任务组成）来大致解决。否则，可以使用先验知识来调整新问题的奖励功能，以使最佳策略保持不变，但可以更快地学习（奖励成型）。在这项工作中，我们开发了一个通用框架，用于在熵登记的RL中进行奖励成型和任务组成。为此，我们得出了一个确切的关系，该关系连接了具有不同奖励函数和动力学的两个熵调查的RL问题的最佳软值函数。我们展示了派生的关系如何导致熵登记的RL中奖励成型的一般结果。然后，我们概括了这种方法，以得出一个确切的关系，该关系连接了熵调查的RL中多个任务组成的最佳值函数。我们通过实验验证了这些理论贡献，表明奖励成型和任务组成会导致在各种环境中更快的学习。

In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题