获胜者将全部付诸实践：培训表演者RL种群进行组合优化

论文标题

获胜者将全部付诸实践：培训表演者RL种群进行组合优化

Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization

论文作者

Grinsztajn, Nathan, Furelos-Blanco, Daniel, Surana, Shikha, Bonnet, Clément, Barrett, Thomas D.

论文摘要

将增强学习（RL）应用于组合优化问题，因为它消除了对专家知识或预先解决的实例的需求，这是有吸引力的。但是，期望代理商在推理中，由于其固有的复杂性，在推理时单次镜头中解决这些（通常是NP-）的硬问题是不现实的。因此，领先的方法通常会实施其他搜索策略，从随机抽样和梁搜索到明确的微调。在本文中，我们主张学习一系列补充政策的好处，这些政策可以同时推论。为此，我们介绍了罂粟，这是一个简单的人群培训程序。罂粟不是依靠预定义或手工制作的多样性概念，而是引起无监督的专业化，而专门针对人口的表现最大化。我们表明，罂粟制定了一系列互补的政策，并获得了四个流行的NP安全问题的最先进的RL结果：旅行推销员，电容车辆路由，0-1 knapsack和工作店计划。

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack, and job-shop scheduling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题