论文标题
用树木奖励学习:方法和评估
Reward Learning with Trees: Methods and Evaluation
论文作者
论文摘要
从人类反馈中学习奖励功能的最新努力倾向于使用深层的神经网络,这些神经网络的缺乏透明度妨碍了我们解释代理行为或验证一致性的能力。我们探索学习本质上可解释的树模型的优点。我们开发了一种从偏好标签中学习奖励树的最近提出的方法,并在挑战高维任务方面与神经网络具有广泛的竞争,对有限或损坏的数据具有良好的稳健性。在发现奖励树学习可以在复杂的环境中有效地完成,然后我们考虑为什么应该使用它,证明可解释的奖励结构为可追溯性,验证和解释提供了显着的范围。
Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.