增强学习中的最大奖励表述

论文标题

增强学习中的最大奖励表述

Maximum Reward Formulation In Reinforcement Learning

论文作者

Gottipati, Sai Krishna, Pathak, Yashaswi, Nuttall, Rohan, Sahir, Chunduru, Raviteja, Touati, Ahmed, Subramanian, Sriram Ganapathi, Taylor, Matthew E., Chandar, Sarath

论文摘要

加强学习（RL）算法通常涉及最大化预期的累积回报（打折或未识别的，有限的或无限的地平线）。但是，在现实世界中的几个关键应用，例如药物发现，不适合该框架，因为RL代理只需要识别在轨迹内获得最高奖励的状态（分子），并且不需要为预期的累积回报进行优化。在这项工作中，我们制定了一个目标函数，以最大程度地沿着轨迹提高预期的最大奖励，得出钟声方程的新功能形式，引入相应的钟手操作员，并提供收敛的证明。使用此公式，我们就模仿现实世界中的药物发现管道的分子产生任务实现了最新的结果。

Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题