无动轨迹的半监督离线增强学习

论文标题

无动轨迹的半监督离线增强学习

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

论文作者

Zheng, Qinqing, Henaff, Mikael, Amos, Brandon, Grover, Aditya

论文摘要

天然代理可以有效地从多个数据源中学习，这些数据源的规模，质量和测量类型不同。我们通过引入新的，实际动机的半监督环境来研究这种异质性（RL）的背景下。在这里，代理可以访问两组轨迹：在每个时间步行中包含状态，动作和奖励三胞胎的标签轨迹，以及仅包含状态和奖励信息的未标记轨迹。在这种情况下，我们开发和研究了一个简单的元偏词管道，该管道在标记的数据上学习了一个反向动力学模型，以获取未标记数据的代理标签，然后使用任何离线RL算法在真实和代理标记的轨迹上使用任何离线RL算法。从经验上讲，我们发现这条简单的管道非常成功 - 在几个D4RL基准测试中〜\ cite {fu2020d4rl}，某些离线RL算法也可以匹配在完全标记的数据集中训练的变体的性能，即使我们仅标记为10 \ thrighte fraightories的轨迹，而这些轨迹的轨迹高度高。 To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.

Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action and reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful -- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10\% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题