动态stackelberg游戏中的无重格学习

论文标题

动态stackelberg游戏中的无重格学习

No-Regret Learning in Dynamic Stackelberg Games

论文作者

Lauffer, Niklas, Ghasemi, Mahsa, Hashemi, Abolfazl, Savas, Yagiz, Topcu, Ufuk

论文摘要

在Stackelberg的游戏中，领导者承诺采用随机策略，而追随者选择了他们的最佳策略。我们考虑了标准的Stackelberg游戏的扩展名，称为“离散时间动态stackelberg”游戏，该游戏具有基本的状态空间，该游戏会影响领导者的奖励和可用策略，并以马尔可夫的方式发展，这取决于领导者和追随者的选定策略。尽管已使用标准的Stackelberg游戏来改善安全域中的计划，但它们的部署通常受到追随者效用功能的完整信息的限制。相反，我们考虑了领导者未知的追随者效用功能的方案。但是，它可以线性化参数化。当时，我们的目标是提供一种算法，该算法根据对追随者在上一步中的响应方式的观察，在游戏的每个步骤中为领导者规定了随机策略。我们设计了一种无重组的学习算法，该算法的可能性很高（与事后见识上的最佳政策相比，这是遗憾的，这是时间步骤数量的sublinear；子宫性的程度取决于代表追随者效用功能的功能的数量。所提出的学习算法的遗憾与游戏的其余参数中的状态空间和多项式的大小无关。我们表明，所提出的学习算法的表现优于现有的无模型强化学习方法。

In a Stackelberg game, a leader commits to a randomized strategy, and a follower chooses their best strategy in response. We consider an extension of a standard Stackelberg game, called a discrete-time dynamic Stackelberg game, that has an underlying state space that affects the leader's rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower's selected strategies. Although standard Stackelberg games have been utilized to improve scheduling in security domains, their deployment is often limited by requiring complete information of the follower's utility function. In contrast, we consider scenarios where the follower's utility function is unknown to the leader; however, it can be linearly parameterized. Our objective then is to provide an algorithm that prescribes a randomized strategy to the leader at each step of the game based on observations of how the follower responded in previous steps. We design a no-regret learning algorithm that, with high probability, achieves a regret bound (when compared to the best policy in hindsight) which is sublinear in the number of time steps; the degree of sublinearity depends on the number of features representing the follower's utility function. The regret of the proposed learning algorithm is independent of the size of the state space and polynomial in the rest of the parameters of the game. We show that the proposed learning algorithm outperforms existing model-free reinforcement learning approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题