多项式logit上下文强盗的可访问在线学习算法

论文标题

多项式logit上下文强盗的可访问在线学习算法

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

论文作者

Agrawal, Priyank, Tulabandhula, Theja, Avadhanula, Vashist

论文摘要

在本文中，我们考虑了MNL-Bandit问题的上下文变体。更具体地说，我们考虑了一个动态设定的优化问题，决策者为消费者提供了一个子集（各种产品），并在每回合中观察响应。消费者购买产品以最大化其实用性。我们假设一组属性描述了产品，并且产品的平均效用在这些属性的值中是线性的。我们使用广泛使用的多项式logit（MNL）模型对消费者选择行为进行建模，并考虑动态学习模型参数的决策者问题，同时优化累计收入在销售范围内$ t $。尽管最近这个问题引起了人们的关注，但许多现有方法通常涉及解决棘手的非凸优化问题。他们的理论绩效保证取决于问题依赖性参数，该参数可能非常大。特别是，此问题的现有算法对$ O（\ sqrt {κDt}）$界后后悔，其中$κ$是问题依赖性常数，可以对属性数量具有指数依赖性。在本文中，我们提出了一种乐观的算法，并表明遗憾由$ o（\ sqrt {dt} +κ）$界定，从而大大提高了现有方法的性能。此外，我们提出了对优化步骤的放松，该步骤允许进行可牵引的决策，同时保留良好的遗憾保证。

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{κd T})$, where $κ$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + κ)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.

下载PDF全文

下载文献需遵守相关版权规定

论文标题