通过无监督的非政策强化学习，新兴的现实机器人技巧

论文标题

通过无监督的非政策强化学习，新兴的现实机器人技巧

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

论文作者

Sharma, Archit, Ahn, Michael, Levine, Sergey, Kumar, Vikash, Hausman, Karol, Gu, Shixiang

论文摘要

强化学习为学习机器人技能提供了一个一般框架，同时最大程度地减少了工程工作。但是，大多数强化学习算法都认为提供了精心设计的奖励功能，并学习该单个奖励功能的单一行为。在实践中，这种奖励功能可能很难设计。我们是否可以开发有效的加强学习方法，这些方法可以在没有任何奖励功能的情况下获得多样化的技能，然后重新利用这些技能来完成下游任务？在本文中，我们证明了最近提出的无监督技能发现算法可以扩展到有效的非政策方法中，使其适合在现实世界中进行无监督的强化学习。首先，我们表明我们提出的算法在学习效率方面提供了很大的提高，从而使无奖励的现实世界培训可行。其次，我们超越了仿真环境，并评估了真实物理硬件的算法。在四足动物上，我们观察到具有不同步态和不同方向的运动技巧，没有任何奖励或示威。我们还证明，可以使用针对目标导航的模型预测控制来组成学习的技能，而无需任何其他培训。

Reinforcement learning provides a general framework for learning robotic skills while minimizing engineering effort. However, most reinforcement learning algorithms assume that a well-designed reward function is provided, and learn a single behavior for that single reward function. Such reward functions can be difficult to design in practice. Can we instead develop efficient reinforcement learning methods that acquire diverse skills without any reward function, and then repurpose these skills for downstream tasks? In this paper, we demonstrate that a recently proposed unsupervised skill discovery algorithm can be extended into an efficient off-policy method, making it suitable for performing unsupervised reinforcement learning in the real world. Firstly, we show that our proposed algorithm provides substantial improvement in learning efficiency, making reward-free real-world training feasible. Secondly, we move beyond the simulation environments and evaluate the algorithm on real physical hardware. On quadrupeds, we observe that locomotion skills with diverse gaits and different orientations emerge without any rewards or demonstrations. We also demonstrate that the learned skills can be composed using model predictive control for goal-oriented navigation, without any additional training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题