论文标题
关于非参数Q功能估计的最佳和最小值的最佳速率
On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation
论文作者
论文摘要
我们研究了无限马尔可夫决策过程中具有连续状态和行动的无限马尔可夫决策过程中的政策评估(OPE)问题。我们将$ Q $功能估计的估计重新销售到非参数仪器变量(NPIV)估计问题的一种特殊形式。我们首先表明,在一种轻度条件下,$ q $功能估计的NPIV配方在$ l^2 $的意义上是充分的,就数据产生分布而言,不适合量不足,绕开了对折现因子$γ$的强有力的假设,该$γ$在最近的文献中$ l^2 $ l^2 $转换率的$ Q $ Q $ -Funuttion估算。借助这一新的供应良好的物业,我们得出了第一个Minimax下限,用于$ Q $功能的非参数估计及其在SUP-NORM和$ L^2 $ -NORM中的融合率,这与经典非参数回归相同(Stone,1982)。然后,我们提出了一个筛子两阶段最小二乘估计器,并在某些轻度条件下在两种规范中建立了其速率优化。我们关于良好的成果和最小值下限是独立的兴趣,不仅要研究其他非参数估计量$ q $函数,而且还要对任何目标策略在非政策环境中的价值进行有效的估计。
We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $γ$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.