通过深厚有条件生成学习的分位数非政策评估

论文标题

通过深厚有条件生成学习的分位数非政策评估

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

论文作者

Xu, Yang, Shi, Chengchun, Luo, Shikai, Wang, Lan, Song, Rui

论文摘要

非政策评估（OPE）涉及使用潜在的不同行为策略生成的离线数据评估新目标策略。对于从医疗保健到技术行业的许多顺序决策问题至关重要。现有文献中的大多数工作都集中在评估给定政策的平均结果，而忽略了结果的可变性。但是，在各种应用中，平均值以外的标准可能更明智。例如，当奖励分布偏斜并且不对称时，基于分位数的指标通常是其稳健性的首选。在本文中，我们提出了一个双重稳定的推理程序，用于在顺序决策和研究其渐近特性中进行分位数OPE。特别是，我们建议利用最先进的深层条件生成学习方法来处理参数依赖性的滋扰功能估计。我们通过简短视频平台的模拟和现实世界数据集证明了该提出的估计器的优势。特别是，我们发现我们提出的估计器在具有重型奖励分布的设置中的平均值优于经典OPE估计器。

Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题