即使是悲观的初始化，乐观的探索

论文标题

即使是悲观的初始化，乐观的探索

Optimistic Exploration even with a Pessimistic Initialisation

论文作者

Rashid, Tabish, Peng, Bei, Böhmer, Wendelin, Whiteson, Shimon

论文摘要

乐观的初始化是在增强学习（RL）中有效探索的有效策略。在表格情况下，所有证明无效的无模型算法都依赖于它。但是，尽管从这些可证明的有效的表格算法中汲取了灵感，但无模型的深度RL算法并未使用乐观的初始化。特别是，在仅具有正奖励的情况下，由于常用的网络初始化方案（一种悲观的初始化），Q值以其最低的可能值初始化。仅将网络初始化以输出乐观的Q值是不够的，因为我们不能确保它们对新型的状态行动对保持乐观，这对于探索至关重要。我们提出了一个简单的基于计数的增强，以对悲观的初始化Q值，将乐观源与神经网络分开。我们表明，该方案在表格设置中效率很高，并将其扩展到深度RL设置。我们的算法，乐观的悲观初始化Q学习（OPIQ）增强了基于DQN的代理的Q值估计，并具有计数衍生的奖励，以确保在动作选择和自举动过程中的乐观态度。我们表明，OPIQ优于非自动DQN变体，这些变体在硬探索任务中利用基于伪的内在动机，并且它可以预测新型状态行动对的乐观估计。

Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题