通过线性功能近似值几乎最小的最佳离线增强学习：单代理MDP和马尔可夫游戏

论文标题

通过线性功能近似值几乎最小的最佳离线增强学习：单代理MDP和马尔可夫游戏

Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game

论文作者

Xiong, Wei, Zhong, Han, Shi, Chengshuai, Shen, Cong, Wang, Liwei, Zhang, Tong

论文摘要

离线增强学习（RL）旨在使用预采用的数据集学习最佳策略，而无需与环境进行进一步的互动。尽管已经在先前文献中提出了针对离线RL的各种算法，但最小值的最佳性仅是（几乎）用于表格马尔可夫决策过程（MDP）。在本文中，我们专注于具有线性函数近似的离线RL，并为离线线性MDP提出了一种新的基于悲观的算法。我们算法的核心是通过参考函数的不确定性分解，这在线性函数近似下的离线RL文献中是新的。理论分析表明，我们的算法可以与对数因素相匹配。我们还将技术扩展到两人零和马尔可夫游戏（MGS），并为MGS建立新的性能下限，从而收紧了现有结果，并验证了拟议算法的几乎最小值。据我们所知，这些是首个具有线性函数近似的离线单机准则MDP和MGS的计算高效且几乎最小的最佳算法。

Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题