论文标题
TD学习的绝热定理用于政策跟踪
An Adiabatic Theorem for Policy Tracking with TD-learning
论文作者
论文摘要
我们评估了时间差异学习随着时间的变化而跟踪策略的奖励功能的能力。我们的结果采用了一种新的绝热定理,该定理范围界定了偶然的马尔可夫链的混合时间。我们为表格的时间差学习和$ Q $ - 何时用于培训时间变化的策略何时得出有限的时间界限。为此,我们在异步绝热更新下开发了随机近似的界限。
We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and $Q$-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.