使用离线演示的指导，以稀疏的奖励进行加强学习

论文标题

使用离线演示的指导，以稀疏的奖励进行加强学习

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

论文作者

Rengarajan, Desik, Vaidya, Gargi, Sarvesh, Akshay, Kalathil, Dileep, Shakkottai, Srinivas

论文摘要

现实世界增强学习（RL）的主要挑战是奖励反馈的稀疏性。通常，可用的是直观但稀疏的奖励功能，仅表示任务是部分或完全完成的。但是，缺乏精心设计的精细谷物反馈意味着大多数现有的RL算法都无法在合理的时间范围内学习可接受的策略。这是因为该政策在获得可以从中学到的任何有用的反馈之前必须执行的大量勘探行动。在这项工作中，我们通过开发一种算法来解决这个具有挑战性的问题，该算法利用了次优行为策略生成的离线演示数据，以在如此稀疏的奖励设置中更快，更有效的在线RL。我们将其称为在线学习的算法（徽标）算法将其称为在线学习，通过使用离线演示数据将政策改进步骤与额外的政策指导步骤合并。关键的想法是，通过从脱机数据中获得指导，徽标以次优政策的方式实现其策略，同时却能够超越并取得最佳性。我们对算法提供了理论分析，并为每个学习情节的性能提高提供了下限。我们还将算法扩展到更具挑战性的不完整观察环境，其中演示数据仅包含真实状态观察的审查版本。我们在许多基准环境中，算法比最先进的方法证明了算法的卓越性能。此外，我们通过在移动机器人上实现徽标来证明我们的方法的价值，以进行轨迹跟踪和避免障碍物，并显示出出色的性能。

A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题