脱机从示范中学习和未标记的经验

论文标题

脱机从示范中学习和未标记的经验

Offline Learning from Demonstrations and Unlabeled Experience

论文作者

Zolna, Konrad, Novikov, Alexander, Konyushkova, Ksenia, Gulcehre, Caglar, Wang, Ziyu, Aytar, Yusuf, Denil, Misha, de Freitas, Nando, Reed, Scott

论文摘要

行为克隆（BC）通常对于机器人学习是实用的，因为它可以通过对专家演示的监督学习来离线训练，而无需奖励。但是，卑诗省并不能有效利用我们将其称为未标记的经验：混合和未知质量的数据而没有奖励注释。这些未标记的数据可以由同一机器人上的人类远距离，脚本策略和其他代理等各种来源生成。朝着可以使用这种未标记的经验的数据驱动的离线机器人学习，我们引入了离线增强模仿学习（Oril）。 Oril首先通过对演示者和未标记的轨迹进行对比的观察结果来学习奖励功能，然后用学习的奖励来注释所有数据，最后通过离线加强学习对代理进行训练。在各种连续控制和模拟机器人操纵任务中，我们表明Oril通过有效利用未标记的经验来始终优于可比的BC代理。

Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.

下载PDF全文

下载文献需遵守相关版权规定

论文标题