通过人类登记的搜索和学习的人为协调

论文标题

通过人类登记的搜索和学习的人为协调

Human-AI Coordination via Human-Regularized Search and Learning

论文作者

Hu, Hengyuan, Wu, David J, Lerer, Adam, Foerster, Jakob, Brown, Noam

论文摘要

我们考虑了在鉴于人类行为数据集的部分可观察到的完全合作的环境中，使AI代理与人类合作的问题与人类合作。受PIKL的启发，PIKL是一种由人数据调查的搜索方法，它在行为克隆政策的情况下改善了不脱离它的行为克隆政策，我们开发了一种三步算法，在与Hanabi基准中的真实人类协调方面达到了强大的性能。我们首先使用正规搜索算法和行为克隆来产生更好的人类模型，以捕获各种技能水平。然后，我们将政策正规化思想整合到加强学习中，以训练人类对人类模型的最佳反应。最后，我们在测试时间的最佳响应政策之上应用正规搜索，以应对与人类比赛时的分布挑战。我们在两个大规模实验中评估了我们的方法。首先，我们证明我们的方法在与一群临时团队中的一群人球员一起比赛时表现优于专家。其次，我们表明我们的方法通过让专家与两个代理商反复玩耍，以对行为克隆基线的最佳反应击败了对行为克隆基线的最佳反应。

We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle out-of-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题