多武器匪徒问题中的顺序多刺测试：渐近最优方法

论文标题

多武器匪徒问题中的顺序多刺测试：渐近最优方法

Sequential Multi-hypothesis Testing in Multi-armed Bandit Problems:An Approach for Asymptotic Optimality

论文作者

Prabhu, Gayathri R, Bhashyam, Srikrishna, Gopalan, Aditya, Sundaresan, Rajesh

论文摘要

我们考虑一个涉及K臂Bandit的多种假设测试问题。每个手臂的信号遵循矢量指数家族的分布。武器的实际参数是决策者未知的。决策者会延迟延迟成本，直到决策和切换成本从一只手臂切换到另一只手臂。他的目标是最大程度地降低整体成本，直到对真实假设做出决定为止。感兴趣的是满足错误检测可能性的限制的政策。这是一个顺序的决策问题，决策者在每个阶段只能对自然状态的真实状态获得有限的看法，但可以通过选择在每个阶段观察的手臂来控制他的观点。首先确定了对总成本的信息理论下限（可靠决策加总切换成本的预期时间），然后研究了基于一般性似然比统计量的顺序策略的变化。由于矢量指数族的假设，每个阶段的信号处理都很简单。未知模型参数上关联的共轭先验分布可以轻松更新后验分布。拟议的策略具有适当的停止阈值，可满足对错误检测概率的限制。在连续的选择假设下，该策略也被证明是根据满足错误检测概率约束的所有政策之间的总成本渐近最佳的。

We consider a multi-hypothesis testing problem involving a K-armed bandit. Each arm's signal follows a distribution from a vector exponential family. The actual parameters of the arms are unknown to the decision maker. The decision maker incurs a delay cost for delay until a decision and a switching cost whenever he switches from one arm to another. His goal is to minimise the overall cost until a decision is reached on the true hypothesis. Of interest are policies that satisfy a given constraint on the probability of false detection. This is a sequential decision making problem where the decision maker gets only a limited view of the true state of nature at each stage, but can control his view by choosing the arm to observe at each stage. An information-theoretic lower bound on the total cost (expected time for a reliable decision plus total switching cost) is first identified, and a variation on a sequential policy based on the generalised likelihood ratio statistic is then studied. Due to the vector exponential family assumption, the signal processing at each stage is simple; the associated conjugate prior distribution on the unknown model parameters enables easy updates of the posterior distribution. The proposed policy, with a suitable threshold for stopping, is shown to satisfy the given constraint on the probability of false detection. Under a continuous selection assumption, the policy is also shown to be asymptotically optimal in terms of the total cost among all policies that satisfy the constraint on the probability of false detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题