通过观察人类来学习机器人操纵的奖励功能

论文标题

通过观察人类来学习机器人操纵的奖励功能

Learning Reward Functions for Robotic Manipulation by Observing Humans

论文作者

Alakuijala, Minttu, Dulac-Arnold, Gabriel, Mairal, Julien, Ponce, Jean, Schmid, Cordelia

论文摘要

观察人类示威者操纵对象为学习机器人策略提供了丰富，可扩展和廉价的数据来源。但是，将技能从人类视频转移到机器人操纵器会带来一些挑战，尤其是动作和观察空间的差异。在这项工作中，我们使用人类未标记的视频解决了各种各样的操纵任务，以了解机器人操纵策略的任务不合时宜的奖励功能。由于这些培训数据的多样性，学到的奖励功能足够概括地从以前看不见的机器人体现和环境中进行图像观察，以提供有意义的先验在强化学习中进行探索。我们提出了两种相对于目标图像得分状态的方法：通过直接的时间回归，以及通过时间对抗性学习获得的嵌入空间中的距离。通过在目标图像上调节功能，我们可以在各种任务中重复使用一个模型。与先前在利用人类视频教机器人的工作不同，我们的方法，人类的离线距离（持有）既不需要从机器人环境中获得的先验数据，也不需要一组特定于任务的人类演示，也不需要在形态学上的对应关系的预定义概念，但是它可以加速对几个操作机器人手臂的操纵任务的加速培训，以完成对跨度的奖励的奖励。

Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题