悲观主义在异步Q学习中的功效

论文标题

悲观主义在异步Q学习中的功效

The Efficacy of Pessimism in Asynchronous Q-Learning

论文作者

Yan, Yuling, Li, Gen, Chen, Yuxin, Fan, Jianqing

论文摘要

本文涉及Q-学习的异步形式，该形式将随机近似方案应用于马尔可夫数据样本。受到离线增强学习的最新进展的激励，我们开发了一个算法框架，将悲观主义的原理纳入异步Q学习，该Q-Learning基于适当的较低置信度界限（LCB），从而惩罚了很少经常访问的州行动对。除其他外，该框架导致了在存在近专家数据的情况下提高样品效率和增强的适应性。我们的方法允许在某些重要情况下观察到的数据仅涵盖部分状态行动空间，这与先前的理论形成了鲜明的对比，该理论需要统一的所有状态行动对覆盖。当降低方差的概念结合在一起时，同步Q学习和LCB惩罚达到了近乎最佳的样本复杂性，前提是目标准确性水平足够小。相比之下，即使在i.i.d.允许采样。我们的结果为在马尔可夫非i.i.d的存在下使用悲观原则提供了第一个理论支持。数据。

This paper is concerned with the asynchronous form of Q-learning, which applies a stochastic approximation scheme to Markovian data samples. Motivated by the recent advances in offline reinforcement learning, we develop an algorithmic framework that incorporates the principle of pessimism into asynchronous Q-learning, which penalizes infrequently-visited state-action pairs based on suitable lower confidence bounds (LCBs). This framework leads to, among other things, improved sample efficiency and enhanced adaptivity in the presence of near-expert data. Our approach permits the observed data in some important scenarios to cover only partial state-action space, which is in stark contrast to prior theory that requires uniform coverage of all state-action pairs. When coupled with the idea of variance reduction, asynchronous Q-learning with LCB penalization achieves near-optimal sample complexity, provided that the target accuracy level is small enough. In comparison, prior works were suboptimal in terms of the dependency on the effective horizon even when i.i.d. sampling is permitted. Our results deliver the first theoretical support for the use of pessimism principle in the presence of Markovian non-i.i.d. data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题