论文标题
从参数专家那里进行有效学习的数据增强
Data augmentation for efficient learning from parametric experts
论文作者
论文摘要
我们提出了一种简单但功能强大的数据实践技术,可以从参数专家中获得增强和模仿学习的数据有效学习。我们专注于所谓的政策克隆设置,在该设置中,我们使用专家或专家政策的在线或离线查询来告知学生政策的行为。这种设置自然出现在许多问题中,例如作为行为克隆的变体,或其他算法的组成部分,例如匕首,策略蒸馏或KL调节的RL。我们的方法,增强的政策克隆(APC)使用合成状态在采样轨迹周围诱导反馈 - 敏感性,从而大大降低了专家成功克隆所需的环境相互作用。我们将行为从专家转移到高度自由度控制问题的学生政策上的高度数据效率转移。我们在几种现有且广泛使用的算法中证明了我们方法的好处,这些算法包括策略克隆作为组成部分。此外,我们在两个实际相关的环境(a)专家压缩中强调了方法的好处,即转移给参数较少的学生; (b)从特权专家转移,即专家的观察空间与学生不同,通常包括获取特权信息。
We present a simple, yet powerful data-augmentation technique to enable data-efficient learning from parametric experts for reinforcement and imitation learning. We focus on what we call the policy cloning setting, in which we use online or offline queries of an expert or expert policy to inform the behavior of a student policy. This setting arises naturally in a number of problems, for instance as variants of behavior cloning, or as a component of other algorithms such as DAGGER, policy distillation or KL-regularized RL. Our approach, augmented policy cloning (APC), uses synthetic states to induce feedback-sensitivity in a region around sampled trajectories, thus dramatically reducing the environment interactions required for successful cloning of the expert. We achieve highly data-efficient transfer of behavior from an expert to a student policy for high-degrees-of-freedom control problems. We demonstrate the benefit of our method in the context of several existing and widely used algorithms that include policy cloning as a constituent part. Moreover, we highlight the benefits of our approach in two practically relevant settings (a) expert compression, i.e. transfer to a student with fewer parameters; and (b) transfer from privileged experts, i.e. where the expert has a different observation space than the student, usually including access to privileged information.