论文标题
具有安全约束的分散多代理线性匪徒
Decentralized Multi-Agent Linear Bandits with Safety Constraints
论文作者
论文摘要
我们研究了分散的随机线性土匪,其中$ n $ agents的网络合作,在$ d $维空间上有效地解决了线性匪徒优化问题。对于这个问题,我们提出了DLUCB:一种完全分散的算法,可最大程度地减少整个网络的累积后悔。在每个回合算法上,每个代理都按照上限(UCB)策略选择其动作,而代理通过精心设计的共识程序与直接邻居共享信息,该程序会重复进行周期。我们的分析调整了这些通信周期的持续时间,以确保以$ \ MATHCAL {O}(O}的通信速率(dn^2)$每回合,我们每轮的通信速率均为近距离的遗憾性能$ \ MATHCAL {o}(d \ log {nt} \ sqrt {nt})$。网络的结构通过一个小的添加剂术语(造成延迟的遗憾)影响了遗憾的性能,这取决于基础图的光谱差距。值得注意的是,我们的结果适用于任意网络拓扑,而无需专用代理充当服务器。考虑到沟通成本较高的情况,我们提出了RC-DLUCB:DLUCB的修改,代理商之间的沟通很少。新算法将遗憾的性能与所有$ t $ rounds上的$ \ mathcal {o}(d^3n^{2.5})$大幅降低的总沟通成本进行了交流。最后,我们表明我们的想法自然而然地扩展到了新兴的,尽管更具挑战性的安全匪徒。对于最近研究的线性匪徒具有未知线性安全约束的问题,我们提出了第一个安全的分散算法。我们的研究有助于在安全至关重要的分布式系统中应用匪徒技术,这些系统反复处理未知的随机环境。我们为各种网络拓扑提供了数值模拟,以证实我们的理论发现。
We study decentralized stochastic linear bandits, where a network of $N$ agents acts cooperatively to efficiently solve a linear bandit-optimization problem over a $d$-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper confidence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance $\mathcal{O}(d\log{NT}\sqrt{NT})$ at a communication rate of $\mathcal{O}(dN^2)$ per round. The structure of the network affects the regret performance via a small additive term - coined the regret of delay - that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modification of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a significantly reduced total communication cost of $\mathcal{O}(d^3N^{2.5})$ over all $T$ rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the first safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical findings.