见面：蒙特卡洛探索 - 探索折衷的缓冲采样取舍

论文标题

见面：蒙特卡洛探索 - 探索折衷的缓冲采样取舍

MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling

论文作者

Ott, Julius, Servadei, Lorenzo, Arjona-Medina, Jose, Rinaldi, Enrico, Mauro, Gianfranco, Lopera, Daniela Sánchez, Stephan, Michael, Stadelmayer, Thomas, Santra, Avik, Wille, Robert

论文摘要

数据选择对于任何基于数据的优化技术，例如增强学习至关重要。经验重播缓冲液的最新抽样策略可改善增强学习代理的性能。但是，它们并未在Q值估计中纳入不确定性。因此，它们无法适应采样策略，包括探索和剥削过渡，以适应任务的复杂性。为了解决这个问题，本文提出了一种新的抽样策略，以利用勘探探索权衡取舍。这是通过对Q值函数的不确定性估计来实现的，Q值函数指导抽样探索更重要的过渡，从而学习更有效的策略。在经典控制环境上进行的实验表明了各种环境的稳定结果。他们表明，所提出的方法优于致密奖励W.R.T.的最先进的抽样策略。平均收敛性和峰值性能达到26％。

Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.

下载PDF全文

下载文献需遵守相关版权规定

论文标题