计划您的目标并学习您的技能：通过脱钩政策优化可转移的仅在国家 /地区的模仿学习

论文标题

计划您的目标并学习您的技能：通过脱钩政策优化可转移的仅在国家 /地区的模仿学习

Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization

论文作者

Liu, Minghuan, Zhu, Zhengbang, Zhuang, Yuzheng, Zhang, Weinan, Hao, Jianye, Yu, Yong, Wang, Jun

论文摘要

仅国家模仿学习的最新进展将模仿学习的适用性扩展到现实世界中的范围，从而减轻了观察专家行动的需求。但是，现有的解决方案只学会从数据中提取州对行动映射策略，而无需考虑专家如何计划到目标。这阻碍了利用示威游行并限制政策的灵活性的能力。在本文中，我们介绍了解耦政策优化（DEPO），该策略优化（DEPO）明确将策略脱离为高级状态计划者和逆动力学模型。通过嵌入式脱钩的策略梯度和生成对抗性培训，DEPO可以将知识转移到不同的行动空间或状态过渡动态，并可以将规划师推广到无示威的状态区域。我们的深入实验分析表明，DEPO在学习最佳模仿性能的同时学习通用目标状态计划者的有效性。我们证明了DEPO通过预训练跨任务转移的吸引人使用，以及与各种技能共同培训代理的潜力。

Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.

下载PDF全文

下载文献需遵守相关版权规定

论文标题