论文标题
在线凸游戏中的规避风险的无重格学习
Risk-Averse No-Regret Learning in Online Convex Games
论文作者
论文摘要
我们考虑使用具有规避风险的代理的在线随机游戏,其目标是学习最佳决策,以最大程度地减少产生高昂成本的风险。具体而言,我们使用处于风险的条件值(CVAR)作为一种风险度量,代理可以以仅选择的动作的成本值的形式使用Bandit反馈来估算。由于成本功能的分布取决于所有通常无法观察的代理的行为,因此它们本身是未知的,因此,成本的CVAR值很难计算。为了应对这一挑战,我们提出了一种新的在线风险学习算法,该算法依赖于使用CVAR值计算出的CVAR梯度的单点Zeroth-sord-sord估计,这些算法是通过适当采样成本函数来估计的。我们表明,该算法以很高的可能性实现了子线性的遗憾。我们还提出了该算法的两种变体,以提高性能。第一个变体依赖于一种新的采样策略,该策略使用上一个迭代中的样本来提高CVAR值的估计准确性。第二个变体采用残留反馈,该反馈使用上一个迭代中的CVAR值来减少CVAR梯度估计的方差。我们从理论上分析了这些变体的收敛属性,并说明了它们在在线市场问题上的表现,我们将其模型为ournot游戏。
We consider an online stochastic game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Specifically, we use the Conditional Value at Risk (CVaR) as a risk measure that the agents can estimate using bandit feedback in the form of the cost values of only their selected actions. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, they are themselves unknown and, therefore, the CVaR values of the costs are difficult to compute. To address this challenge, we propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values that are estimated by appropriately sampling the cost functions. We show that this algorithm achieves sub-linear regret with high probability. We also propose two variants of this algorithm that improve performance. The first variant relies on a new sampling strategy that uses samples from the previous iteration to improve the estimation accuracy of the CVaR values. The second variant employs residual feedback that uses CVaR values from the previous iteration to reduce the variance of the CVaR gradient estimates. We theoretically analyze the convergence properties of these variants and illustrate their performance on an online market problem that we model as a Cournot game.