论文标题

与对抗代理的合作线性匪徒:近乎最佳的遗憾界限

Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds

论文作者

Mitra, Aritra, Adibi, Arman, Pappas, George J., Hassani, Hamed

论文摘要

我们考虑一个线性随机匪徒问题,该问题涉及$ m $ $代理,该问题可以通过中央服务器进行协作以最大程度地减少遗憾。这些代理商中的一小部分$α$是对抗性的,可以任意采取行动,从而导致以下紧张局势:虽然协作可以减少遗憾,但也可能破坏由于对手而导致的学习过程。在这项工作中,我们通过设计新的算法来平衡探索探索折衷方案,通过精心构建的稳健置信区间来平衡探索探索折衷的新算法,从而提供了对这种张力的基本理解。我们还通过严格的分析来补充我们的算法。首先,我们开发了一种强大的协作分阶段消除算法,该算法可以实现$ \ tilde {o} \ left(α+ 1/\ sqrt {m} \ right)\ sqrt {dt} $遗憾的每个好代理人;在这里,$ d $是模型维度,而$ t $是地平线。对于小$α$,我们的结果表明,尽管有对手,但仍然有合作的明显好处。然后,使用信息理论参数,我们证明了一个匹配的下限,从而为与对手的协作线性土匪提供了第一组紧密,近乎最佳的后悔界限。此外,通过利用高维鲁棒统计数据的最新进展,我们将算法思想和结果显着扩展到(i)允许非线性观察图的广义线性匪徒模型; (ii)允许时间变化特征向量的上下文匪徒设置。

We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $α$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(α+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $α$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源