符号指导的事后先验，用于从人类偏好中奖励学习

论文标题

符号指导的事后先验，用于从人类偏好中奖励学习

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

论文作者

Verma, Mudit, Metcalf, Katherine

论文摘要

指定奖励学习的奖励（RL）代理人具有挑战性。基于偏好的RL（PBRL）通过推断出对轨迹集的反馈的奖励来减轻这些挑战。但是，PBRL的有效性受可靠地恢复目标奖励结构所需的反馈量的限制。我们提出了先前的奖励（先前）框架，该框架将有关奖励功能的结构和偏好反馈的结构纳入奖励学习过程。将这些先验施加在奖励学习目标上的软限制中，可以减少一半所需的反馈量，并改善总体奖励恢复。此外，我们证明，使用抽象状态空间计算先验，进一步改善了奖励学习和代理商的表现。

Specifying rewards for reinforcement learned (RL) agents is challenging. Preference-based RL (PbRL) mitigates these challenges by inferring a reward from feedback over sets of trajectories. However, the effectiveness of PbRL is limited by the amount of feedback needed to reliably recover the structure of the target reward. We present the PRIor Over Rewards (PRIOR) framework, which incorporates priors about the structure of the reward function and the preference feedback into the reward learning process. Imposing these priors as soft constraints on the reward learning objective reduces the amount of feedback required by half and improves overall reward recovery. Additionally, we demonstrate that using an abstract state space for the computation of the priors further improves the reward learning and the agent's performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题