学习通过视频数据的潜在奖励成型和演示进行运行

论文标题

学习通过视频数据的潜在奖励成型和演示进行运行

Learning to Run with Potential-Based Reward Shaping and Demonstrations from Video Data

论文作者

Malysheva, Aleksandra, Kudenko, Daniel, Shpilman, Aleksei

论文摘要

从头开始学习为人形机器人产生有效的运动行为是一个困难的问题，正如在NIPS 2017上的“学习进行”竞赛所说明的那样。该竞赛的目的是训练人形机构的两足模型在模拟的比赛中以最大的速度运行。所有提交都采用了Tabula Rasa方法来加强学习（RL），并且能够产生相对较快但不是最佳的跑步行为。在本文中，我们演示了来自人类跑步视频的数据（例如，从YouTube获取）如何用于塑造人形学习剂的奖励，以加快学习速度并产生更好的结果。具体而言，我们正在使用定期时间间隔的关键身体部位的位置来定义潜在的基于潜在的奖励成型（PBR）的潜在功能。由于PBR不会改变最佳策略，因此这种方法允许RL代理在视频中显示的人类运动中克服次优。我们提出了实验，其中我们将NIP竞争的前十种方法与进一步的优化相结合，以创建高性能代理作为基线。然后，我们演示了基于视频的奖励成型如何进一步提高性能，从而导致RL代理在12小时的训练中运行的速度是基线的两倍。我们此外表明，我们的方法可以克服视频中最佳的跑步行为，而学习的策略显着优于视频中运行代理的跑步行为。

Learning to produce efficient movement behaviour for humanoid robots from scratch is a hard problem, as has been illustrated by the "Learning to run" competition at NIPS 2017. The goal of this competition was to train a two-legged model of a humanoid body to run in a simulated race course with maximum speed. All submissions took a tabula rasa approach to reinforcement learning (RL) and were able to produce relatively fast, but not optimal running behaviour. In this paper, we demonstrate how data from videos of human running (e.g. taken from YouTube) can be used to shape the reward of the humanoid learning agent to speed up the learning and produce a better result. Specifically, we are using the positions of key body parts at regular time intervals to define a potential function for potential-based reward shaping (PBRS). Since PBRS does not change the optimal policy, this approach allows the RL agent to overcome sub-optimalities in the human movements that are shown in the videos. We present experiments in which we combine selected techniques from the top ten approaches from the NIPS competition with further optimizations to create an high-performing agent as a baseline. We then demonstrate how video-based reward shaping improves the performance further, resulting in an RL agent that runs twice as fast as the baseline in 12 hours of training. We furthermore show that our approach can overcome sub-optimal running behaviour in videos, with the learned policy significantly outperforming that of the running agent from the video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题