关于折现因子在离线增强学习中的作用

论文标题

关于折现因子在离线增强学习中的作用

On the Role of Discount Factor in Offline Reinforcement Learning

论文作者

Hu, Hao, Yang, Yiqin, Zhao, Qianchuan, Zhang, Chongjie

论文摘要

离线增强学习（RL）可以从先前收集的数据中进行有效的学习，而无需探索，这在探索昂贵甚至不可行时在现实世界应用中显示出巨大的希望。折扣因子$γ$在提高在线RL样本效率和估计准确性方面起着至关重要的作用，但是折现因子在离线RL中的作用尚未得到很好的探索。本文通过理论分析研究了$γ$在离线RL中的两个不同影响，即正则化效果和悲观效应。一方面，$γ$是在现有离线技术下以样品效率而定的最佳选择的调节器。另一方面，较低的指导$γ$也可以看作是一种悲观的方式，我们在最坏的模型中优化了政策的性能。我们通过表格MDP和标准D4RL任务从经验上验证上述理论观察。结果表明，折现因子在离线RL算法的性能中起着至关重要的作用，无论是在现有的离线方法的小型数据制度下还是在没有其他保守方法的大型数据制度中。

Offline reinforcement learning (RL) enables effective learning from previously collected data without exploration, which shows great promise in real-world applications when exploration is expensive or even infeasible. The discount factor, $γ$, plays a vital role in improving online RL sample efficiency and estimation accuracy, but the role of the discount factor in offline RL is not well explored. This paper examines two distinct effects of $γ$ in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect. On the one hand, $γ$ is a regulator to trade-off optimality with sample efficiency upon existing offline techniques. On the other hand, lower guidance $γ$ can also be seen as a way of pessimism where we optimize the policy's performance in the worst possible models. We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservative methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题