如何利用离线强化学习中的未标记数据

论文标题

如何利用离线强化学习中的未标记数据

How to Leverage Unlabeled Data in Offline Reinforcement Learning

论文作者

Yu, Tianhe, Kumar, Aviral, Chebotar, Yevgen, Hausman, Karol, Finn, Chelsea, Levine, Sergey

论文摘要

离线增强学习（RL）可以从静态数据集学习控制策略，但是，像标准RL方法一样，它需要每个过渡的奖励注释。在许多情况下，将大型数据集用奖励标记可能是昂贵的，尤其是如果人类标签必须提供这些奖励，同时收集多样的未标记数据可能相对便宜。我们如何在离线RL中最好地利用这种未标记的数据？一种自然的解决方案是从标记的数据中学习奖励功能，并使用它标记未标记的数据。在本文中，我们发现，也许令人惊讶的是，一种简单得多的方法，它简单地将零奖励用于未标记的数据会导致在理论上和实践中都没有学习任何奖励模型，从而导致有效的数据共享。虽然这种方法起初可能看起来很奇怪（并且不正确），但我们提供了广泛的理论和经验分析，说明了它如何交易奖励偏见，样本复杂性和分配变化，通常会导致良好的结果。我们表征了这种简单策略有效的条件，并进一步表明，使用简单的重新加权方法扩展它可以进一步缓解通过使用不正确的奖励标签引入的偏见。我们的经验评估证实了模拟机器人运动，导航和操纵设置中的这些发现。

Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题