在小样本制度中了解对抗性模仿学习：阶段耦合分析

论文标题

在小样本制度中了解对抗性模仿学习：阶段耦合分析

Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

论文作者

Xu, Tian, Li, Ziniu, Yu, Yang, Luo, Zhi-Quan

论文摘要

模仿学习从专家轨迹中学习政策。尽管据信专家数据对于模仿质量至关重要，但发现一种模仿学习方法，对抗性模仿学习（AIL）可以具有出色的性能。只需只有一个专家轨迹，即使在诸如运动控制之类的任务上，AIL也可以符合专家性能。这种现象中有两个神秘的要点。首先，为什么AIL只能通过几个专家轨迹表现良好？其次，尽管计划范围的时间长，但为什么AIL仍能保持良好的表现？在本文中，我们从理论上探讨了这两个问题。对于总基于距离距离的AIL（称为TV-ail），我们的分析显示了一个无水平的模仿差距$ \ Mathcal O（\ {\ {\ min \ {1，\ sqrt {| \ sqrt {| \ Mathcal s |/n} \}）$在locomotion Control Controls Tasks摘要的类别上的类别上。这里$ | \ Mathcal S | $是表格马尔可夫决策过程的状态空间大小，而$ n $是专家轨迹的数量。我们强调了界限的两个重要特征。首先，在小样本制度中，这种界限都是有意义的。其次，这一界限表明，无论计划范围如何，电视填充的模仿缝隙最多都是1。因此，这种结合可以解释经验观察。从技术上讲，我们利用了电视填充中多阶段策略优化的结构，并通过动态编程提出了新的舞台耦合分析

Imitation learning learns a policy from expert trajectories. While the expert data is believed to be crucial for imitation quality, it was found that a kind of imitation learning approach, adversarial imitation learning (AIL), can have exceptional performance. With as little as only one expert trajectory, AIL can match the expert performance even in a long horizon, on tasks such as locomotion control. There are two mysterious points in this phenomenon. First, why can AIL perform well with only a few expert trajectories? Second, why does AIL maintain good performance despite the length of the planning horizon? In this paper, we theoretically explore these two questions. For a total-variation-distance-based AIL (called TV-AIL), our analysis shows a horizon-free imitation gap $\mathcal O(\{\min\{1, \sqrt{|\mathcal S|/N} \})$ on a class of instances abstracted from locomotion control tasks. Here $|\mathcal S|$ is the state space size for a tabular Markov decision process, and $N$ is the number of expert trajectories. We emphasize two important features of our bound. First, this bound is meaningful in both small and large sample regimes. Second, this bound suggests that the imitation gap of TV-AIL is at most 1 regardless of the planning horizon. Therefore, this bound can explain the empirical observation. Technically, we leverage the structure of multi-stage policy optimization in TV-AIL and present a new stage-coupled analysis via dynamic programming

下载PDF全文

下载文献需遵守相关版权规定

论文标题