使用生成模型，通过不完美的演示来塑造增强学习的奖励

论文标题

使用生成模型，通过不完美的演示来塑造增强学习的奖励

Shaping Rewards for Reinforcement Learning with Imperfect Demonstrations using Generative Models

论文作者

Wu, Yuchen, Mozifian, Melissa, Shkurti, Florian

论文摘要

无模型的增强学习对真正机器人系统的潜在好处受其不知情的探索限制，从而导致收敛缓慢，缺乏数据效率以及与环境的不必要的相互作用。为了解决这些缺点，我们提出了一种方法，该方法通过将奖励功能与使用生成模型从演示数据训练的状态和行动依赖性潜力组合来结合增强和模仿学习。我们表明，这通过指定州的高价值领域和值得首先探索的国家和行动空间来加速政策学习。与大多数假设最佳演示并将演示数据纳入政策优化的严格限制的现有方法不同，我们将演示数据作为建议将奖励塑造潜在的奖励的形式纳入了建议，该建议是培训的潜在培训，作为国家和行动的生成模型。特别是，我们检查了标准化流量和生成对抗网络以表示这些潜力。我们表明，与许多现有的方法将演示作为硬性约束的方法不同，即使在次优和嘈杂的示范中，我们的方法也是公正的。我们提供了广泛的模拟以及Franka Emika 7DOF ARM的实验，以证明我们方法的实用性。

The potential benefits of model-free reinforcement learning to real robotics systems are limited by its uninformed exploration that leads to slow convergence, lack of data-efficiency, and unnecessary interactions with the environment. To address these drawbacks we propose a method that combines reinforcement and imitation learning by shaping the reward function with a state-and-action-dependent potential that is trained from demonstration data, using a generative model. We show that this accelerates policy learning by specifying high-value areas of the state and action space that are worth exploring first. Unlike the majority of existing methods that assume optimal demonstrations and incorporate the demonstration data as hard constraints on policy optimization, we instead incorporate demonstration data as advice in the form of a reward shaping potential trained as a generative model of states and actions. In particular, we examine both normalizing flows and Generative Adversarial Networks to represent these potentials. We show that, unlike many existing approaches that incorporate demonstrations as hard constraints, our approach is unbiased even in the case of suboptimal and noisy demonstrations. We present an extensive range of simulations, as well as experiments on the Franka Emika 7DOF arm, to demonstrate the practicality of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题