学习奖励功能的动态意识比较

论文标题

学习奖励功能的动态意识比较

Dynamics-Aware Comparison of Learned Reward Functions

论文作者

Wulfe, Blake, Balakrishna, Ashwin, Ellis, Logan, Mercat, Jean, McAllister, Rowan, Gaidon, Adrien

论文摘要

学习奖励功能的能力在使智能代理在现实世界中的部署中起着重要作用。但是，比较奖励功能，例如作为评估奖励学习方法的一种手段，提出了挑战。通常通过考虑优化政策的行为来比较奖励功能，但是这种方法将奖励功能中的缺陷与用于优化其优化的策略搜索算法的缺陷相结合。为了应对这一挑战，Gleave等人。（2020）提出了等效的政策不变比较（EPIC）距离。 EPIC避免了策略优化，但是这样做需要在系统动力学下可能不可能的过渡时计算奖励值。对于学习的奖励功能来说，这是有问题的，因为它需要在训练分布之外评估它们，从而导致奖励价值不准确，我们表明，在比较奖励时可能会使史诗无效。为了解决这个问题，我们提出了动态感知奖励距离（DARD），这是一种新的奖励伪计。 Dard使用环境的近似过渡模型将奖励功能转换为一种形式，该形式允许比较不变以奖励成型，同时仅评估接近其训练分布的过渡方面的奖励功能。模拟物理域中的实验表明，DARD可以在没有政策优化的情况下实现可靠的奖励比较，并且在处理学习的奖励功能时，比下游政策绩效的基线方法更为明显。

The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题