Clutr：课程通过无监督的任务表示学习学习

论文标题

Clutr：课程通过无监督的任务表示学习学习

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

论文作者

Azad, Abdus Salam, Gur, Izzeddin, Emhoff, Jasper, Alexis, Nathaniel, Faust, Aleksandra, Abbeel, Pieter, Stoica, Ion

论文摘要

强化学习（RL）算法通常以样本效率低下和困难的概括而闻名。最近，无监督的环境设计（UED）通过同时学习对生成的任务的任务分布和代理策略，作为零击概括的新范式。这是一个非平稳过程，任务分布与代理策略一起演变；随着时间的流逝而产生不稳定。尽管过去的作品证明了这种方法的潜力，但从任务空间进行有效抽样仍然是一个开放的挑战，可以瓶颈这些方法。为此，我们介绍了Clutr：一种新颖的无监督课程学习算法，将任务表示和课程学习解除为两阶段的优化。它首先在随机生成的任务上训练经常性的自动编码器，以学习潜在的任务歧管。接下来，教师经纪人通过最大程度地提高基于最小值的遗憾目标来创建课程。使用固定的任务歧管，我们表明Clutr成功克服了非平稳性问题并改善了稳定性。我们的实验结果表明，在具有挑战性的载载和导航环境中，Clutr的表现优于一种原则性和流行的UED方法：分别实现10.6倍和45 \％\％的零弹药概括。 Clutr还与非原本最新的载膜效果相当，同时需要减少500倍的环境相互作用。

Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the generated tasks. This is a non-stationary process where the task distribution evolves along with agent policies; creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel unsupervised curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by maximizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. Using the fixed-pretrained task manifold, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in the challenging CarRacing and navigation environments: achieving 10.6X and 45\% improvement in zero-shot generalization, respectively. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, while requiring 500X fewer environment interactions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题