论文标题
元学习对抗土匪
Meta-Learning Adversarial Bandits
论文作者
论文摘要
我们研究在线学习,并在多个任务上使用匪徒反馈来研究,目的是根据某些自然的任务相似度度量相似,旨在提高跨任务的平均绩效。作为第一个针对对抗性设置的人,我们设计了一个统一的元估计值,可为两种重要情况提供特定于设置的保证:多臂匪徒(MAB)和Bandit线性优化(BLO)。对于mAb,元算法调节了众所周知的EXP3方法的tsallis-内部概括概括的初始化,步骤大小和熵参数,如果任务平均遗憾的是,如果分布超过了估计的Optima-in-in-In-In-In-In-In-In-In-In-Insight,则可以改善。对于BLO,我们学习了在线镜像下降(OMD)的初始化,阶梯尺寸和边界偏移,并具有自我隔离屏障的正规化器,表明任务平均遗憾直接随着这些功能在动作空间内部的措施引起的措施而变化。我们的自适应保证依赖于证明未注册的跟随领导者与乘法权重相结合足以在线学习Bregman Diverence的仿射功能的非平滑和非凸序,这使OMD的遗憾上升了。
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.