论文标题
在少数试验中,广义隐藏参数MDP可转移的基于模型的RL
Generalized Hidden Parameter MDPs Transferable Model-based RL in a Handful of Trials
论文作者
论文摘要
人们对创建可以解决许多(相关)任务并在初始培训后适应新任务和环境的RL代理人有广泛的兴趣。基于模型的RL利用学会的替代模型来描述单个任务的动态和奖励,以便在良好的替代物中计划可以良好控制真实的系统。层次模型不是从头开始解决每个任务,而是可以利用一个事实,即任务通常与(未观察到的)因果因素相关,以实现有效的概括,因为在学习项目的质量如何影响提起所需的力量可以推广到以前未观察到的质量。我们提出了一般的隐藏参数MDP(GHP-MDP),该参数描述了一个MDP家族,在该家族中,动态和奖励可以随着隐藏参数而变化的函数,这些参数跨任务各不相同。 GHP-MDP具有带有潜在变量的基于模型的RL,可捕获这些隐藏的参数,从而促进跨任务的转移。 We also explore a variant of the model that incorporates explicit latent structure mirroring the causal factors of variation across tasks (for instance: agent properties, environmental factors, and goals).我们通过奖励和动态潜在空间在新的具有挑战性的Mujoco任务上实验表明了最先进的性能和样品效率,同时以$> 10 \ times $少的数据来击败先前的最先进的基线。使用对潜在变量的测试时间推断,我们的方法在单个情节中概括为动态和奖励的新型组合以及新颖的回报。
There is broad interest in creating RL agents that can solve many (related) tasks and adapt to new tasks and environments after initial training. Model-based RL leverages learned surrogate models that describe dynamics and rewards of individual tasks, such that planning in a good surrogate can lead to good control of the true system. Rather than solving each task individually from scratch, hierarchical models can exploit the fact that tasks are often related by (unobserved) causal factors of variation in order to achieve efficient generalization, as in learning how the mass of an item affects the force required to lift it can generalize to previously unobserved masses. We propose Generalized Hidden Parameter MDPs (GHP-MDPs) that describe a family of MDPs where both dynamics and reward can change as a function of hidden parameters that vary across tasks. The GHP-MDP augments model-based RL with latent variables that capture these hidden parameters, facilitating transfer across tasks. We also explore a variant of the model that incorporates explicit latent structure mirroring the causal factors of variation across tasks (for instance: agent properties, environmental factors, and goals). We experimentally demonstrate state-of-the-art performance and sample-efficiency on a new challenging MuJoCo task using reward and dynamics latent spaces, while beating a previous state-of-the-art baseline with $>10\times$ less data. Using test-time inference of the latent variables, our approach generalizes in a single episode to novel combinations of dynamics and reward, and to novel rewards.