与对抗代理的合作线性匪徒：近乎最佳的遗憾界限

论文标题

与对抗代理的合作线性匪徒：近乎最佳的遗憾界限

Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds

论文作者

Mitra, Aritra, Adibi, Arman, Pappas, George J., Hassani, Hamed

论文摘要

我们考虑一个线性随机匪徒问题，该问题涉及$ m $ $代理，该问题可以通过中央服务器进行协作以最大程度地减少遗憾。这些代理商中的一小部分$α$是对抗性的，可以任意采取行动，从而导致以下紧张局势：虽然协作可以减少遗憾，但也可能破坏由于对手而导致的学习过程。在这项工作中，我们通过设计新的算法来平衡探索探索折衷方案，通过精心构建的稳健置信区间来平衡探索探索折衷的新算法，从而提供了对这种张力的基本理解。我们还通过严格的分析来补充我们的算法。首先，我们开发了一种强大的协作分阶段消除算法，该算法可以实现$ \ tilde {o} \ left（α+ 1/\ sqrt {m} \ right）\ sqrt {dt} $遗憾的每个好代理人；在这里，$ d $是模型维度，而$ t $是地平线。对于小$α$，我们的结果表明，尽管有对手，但仍然有合作的明显好处。然后，使用信息理论参数，我们证明了一个匹配的下限，从而为与对手的协作线性土匪提供了第一组紧密，近乎最佳的后悔界限。此外，通过利用高维鲁棒统计数据的最新进展，我们将算法思想和结果显着扩展到（i）允许非线性观察图的广义线性匪徒模型；（ii）允许时间变化特征向量的上下文匪徒设置。

We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $α$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(α+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $α$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题