论文标题
离线估计受控马尔可夫链:最小和样品复杂性
Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity
论文作者
论文摘要
在这项工作中,我们研究了有限控制的马尔可夫链的过渡概率矩阵的天然非参数估计器。我们考虑使用固定数据集的离线设置,该设置使用所谓的记录策略收集。我们为估计器开发样本复杂性界限,并确定最小值的条件。我们的统计界限取决于其混合属性的记录策略。我们表明,实现特定的统计风险结合涉及混合特性的强度与样品数量之间的微妙而有趣的权衡。我们在各种示例中证明了结果的有效性,例如厄加德马尔可夫链,微弱的不均匀马尔可夫链以及具有非平稳马尔可夫,情节性和贪婪对照的受控马尔可夫链。最后,我们使用这些样本复杂性界限来建立伴随的复杂性范围,以离线评估固定的马尔可夫控制策略。
In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.