论文标题
Maxmin Q学习:控制Q学习的估计偏差
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
论文作者
论文摘要
Q学习遭受高估偏差,因为它使用最大估计的动作值近似于最大动作值。已经提出了算法来减少高估偏差,但我们对偏见与性能的相互作用以及现有算法减轻偏见的程度缺乏了解。在本文中,我们1)强调,高估偏见对学习效率的影响是依赖环境的。 2)提出了Q-学习的概括,称为\ emph {maxmin q-Learning},该参数提供了灵活地控制偏差的参数; 3)从理论上讲,存在Maxmin Q学习的参数选择,该参数导致无偏见的估计,而近似差异较低。 4)使用一种新型的广义Q学习框架,证明了在表格情况下我们的算法的收敛,以及以前几种Q学习变体的收敛性。我们从经验上验证我们的算法可以更好地控制玩具环境中的估计偏差,并且在几个基准问题上取得了卓越的性能。
Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.