信心时相信该模型：基于掩盖的基于模型的演员批评

论文标题

信心时相信该模型：基于掩盖的基于模型的演员批评

Trust the Model When It Is Confident: Masked Model-based Actor-Critic

论文作者

Pan, Feiyang, He, Jia, Tu, Dandan, He, Qing

论文摘要

人们普遍认为，基于模型的增强学习（RL）比无模型的RL更有效，但是在实践中，由于模型错误过高，并非总是如此。在复杂和嘈杂的设置中，如果模型不知道何时相信该模型，则基于模型的RL往往会遇到麻烦。在这项工作中，我们发现更好的模型使用可以产生巨大的变化。从理论上讲，如果使用模型生成的数据的使用仅限于模型误差很小的状态行动对，则可以减少模型和实际推出之间的性能差距。它只有在模型对其预测充满信心时才能使用模型推出。我们提出了基于蒙版的基于模型的Actor-Critic（M2AC），这是一种新型的策略优化算法，可最大化基于模型的真实值函数的下限。 M2AC根据模型的不确定性实现掩蔽机制，以决定是否应该使用其预测。因此，新算法倾向于改进政策。在连续控制基准上进行的实验表明，即使在非常嘈杂的环境中使用长期模型推出时，M2AC的性能也很强，并且显着优于先前的最新方法。

It is a popular belief that model-based Reinforcement Learning (RL) is more sample efficient than model-free RL, but in practice, it is not always true due to overweighed model errors. In complex and noisy settings, model-based RL tends to have trouble using the model if it does not know when to trust the model. In this work, we find that better model usage can make a huge difference. We show theoretically that if the use of model-generated data is restricted to state-action pairs where the model error is small, the performance gap between model and real rollouts can be reduced. It motivates us to use model rollouts only when the model is confident about its predictions. We propose Masked Model-based Actor-Critic (M2AC), a novel policy optimization algorithm that maximizes a model-based lower-bound of the true value function. M2AC implements a masking mechanism based on the model's uncertainty to decide whether its prediction should be used or not. Consequently, the new algorithm tends to give robust policy improvements. Experiments on continuous control benchmarks demonstrate that M2AC has strong performance even when using long model rollouts in very noisy environments, and it significantly outperforms previous state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题